Now that the collections are finally full of vectors, we can start querying the database. There are many ways we can enter information to query the database, but I think there are 2 very useful inputs we can use:
- An input text
- An input vector ID
3.1 Query vectors with an input vector
Let's say we build this vector database to use in a search engine. In this case, we expect the user input to be input text and we have to return the most relevant elements.
Since all operations on a vector database are performed with… VECTORS, we first need to transform the text entered by the user into a vector so that we can find similar items based on that input. Remember that we use Sentence Transformers to encode textual data in embeddings, so we can use the same encoder to generate a numerical representation for user-entered text.
Since NPR contains news articles, let's say the user wrote “Donald Trump” For more information on the US elections:
query_text = "Donald Trump"
query_vector = encoder.encode(query_text).tolist()
print (query_vector)
# output: (-0.048, -0.120, 0.695, ...)
Once the input query vector is calculated, we can find the closest vectors in the collection and define what type of output we want from those vectors, such as its newsId, qualification, and topics:
from qdrant_client.models import Filter
from qdrant_client.http import modelsclient.search(
collection_name="news-articles",
query_vector=query_vector,
with_payload=("newsId", "title", "topics"),
query_filter=None
)
Note– By default, Qdrant uses Approximate Nearest Neighbors to quickly find embeddings, but tech/documentation/concepts/search/#search-api” rel=”noopener ugc nofollow” target=”_blank”>You can also do a full scan and bring exactly the closest neighbors. — just keep in mind that this is a much more expensive operation.
After running this operation, here are the output titles generated (translated into English for better understanding):
- Entrance prayer: donald trump
- Output 1: Paraguayans will go to the polls this Sunday (30th) to elect a new president
- Exit 2: Voters say Biden and Trump should not run in 2024, Reuters/Ipsos poll shows
- Exit 3: Writer accuses Trump of sexually abusing her in the '90s
- Exit 4: Mike Pence, former vice president of Donald Trump, gives testimony before the court that could complicate the former president
It seems that in addition to bringing news related to Trump himself, the embedding model also managed to represent topics related to the presidential election. Notice that in the first result, there is no direct reference to the input term “donald trump”apart from the presidential elections.
Furthermore, I left out a query_filter parameters. This is a very useful tool if you want to specify that the output must satisfy some certain condition. For example, on a news portal, it is often important to filter only the most recent articles (say from the last 7 days onwards). Therefore, you can query news articles that meet a minimum publication timestamp.
Note: In the context of news recommendation, there are multiple concerning aspects to consider, such as equity and diversity. This is an open topic for discussion but if you are interested in this area, take a look at the articles on STANDARDIZATION Workshop.
3.2 Query vectors with an input vector ID
Finally, we can ask the vector database to “recommend” items that are closer to some desired vector IDs but far from undesired vector IDs. The wanted and unwanted IDs are called positive and negative examples, respectively, and are considered seeds for the recommendation.
For example, let's say we have the following positive ID:
seed_id = '8bc22460-532c-449b-ad71-28dd86790ca2'
# title (translated): 'Learn why Joe Biden launched his bid for re-election this Tuesday'
We can then request elements similar to this example:
client.recommend(
collection_name="news-articles",
positive=(seed_id),
negative=None,
with_payload=("newsId", "title", "topics")
)
After running this operation, here are the translated output titles:
- Input element: Find out why Joe Biden launched his re-election bid this Tuesday
- Output 1: Biden announces that he will run for re-election
- Output 2: United States: the 4 reasons that led Biden to run for re-election
- Output 3: Voters say Biden and Trump should not run in 2024, Reuters/Ipsos poll shows
- Output 4: The gaffe by Biden's advisor that raised doubts about a possible second government after the elections