In AWS re:Invent In 2023, we are announcing general availability of knowledge bases for amazon Bedrock. With Knowledge Bases for amazon Bedrock, you can securely connect base models (FM) in amazon Bedrock to your enterprise data using a fully managed recovery augmented generation (RAG) model.
For RAG-based applications, the accuracy of the responses generated by FMs depends on the context provided to the model. Contexts are retrieved from vector stores based on user queries. In the newly released feature for amazon Bedrock knowledge bases, hybrid search, you can combine semantic search with keyword search. However, in many situations, you may need to recover documents created in a defined period or tagged with certain categories. To refine your search results, you can filter based on document metadata to improve retrieval accuracy, which in turn leads to more relevant FM generations aligned with your interests.
In this post, we discuss the new custom metadata filtering feature in Knowledge Bases for amazon Bedrock, which you can use to improve search results by pre-filtering your retrievals from vector stores.
Metadata Filtering Overview
Prior to the launch of metadata filtering, all semantically relevant fragments up to the preset maximum would be returned as context for the FM to use to generate a response. Now, with metadata filters, you can retrieve not only semantically relevant fragments, but also a well-defined subset of those relevant fragments based on the applied metadata filters and associated values.
With this feature, you can now provide a custom metadata file (each up to 10 KB) for each knowledge base document. You can apply filters to your retrievals, telling the vector store to pre-filter based on document metadata and then search for relevant documents. This way, you will have control over the recovered documents, especially if your queries are ambiguous. For example, you can use legal documents with similar terms for different contexts, or movies that have a similar plot released in different years. Additionally, reducing the number of fragments searched provides performance benefits such as a reduction in CPU cycles and the cost of querying the vector store, as well as an improvement in accuracy.
To use the metadata filtering feature, you must provide metadata files along with the source data files with the same name as the source data file and .metadata.json
suffix. Metadata can be strings, numbers, or booleans. The following is an example of the content of the metadata file:
The amazon Bedrock knowledge base metadata filtering feature is available in the US East (N. Virginia) and US West (Oregon) AWS regions.
The following are common use cases for metadata filtering:
- Document chatbot for a software company – This allows users to find product information and troubleshooting guides. Filters on operating system or application version, for example, can help avoid retrieving obsolete or irrelevant documents.
- Conversational search for an organization's application. – This allows users to search for documents, kanbans, meeting recording transcripts, and other assets. By using metadata filters on workgroups, business units, or project IDs, you can personalize the chat experience and improve collaboration. An example would be “What is the status of the Sphinx project and the risks raised?”, where users can filter documents for a specific project or source type (such as email or meeting documents).
- Smart search for software developers – This allows developers to search for information on a specific version. Filters based on release version and document type (such as code, API reference, or issue) can help identify relevant documents.
Solution Overview
In the following sections, we demonstrate how to prepare a data set to use as a knowledge base and then perform queries with metadata filtering. You can query using the AWS Management Console or the SDK.
Prepare a dataset for amazon Bedrock knowledge bases
For this post, we use a sample data set about fictional video games to illustrate how to ingest and retrieve metadata using amazon Bedrock knowledge bases. If you want to follow it in your own AWS account, download the file.
If you want to add metadata to your documents in an existing knowledge base, create the metadata files with the expected file name and schema, then go to the step to synchronize your data with the knowledge base to start incremental ingestion.
In our sample data set, each game document is a separate CSV file (e.g. s3://$bucket_name/video_game/$game_id.csv
) with the following columns:
title
, description
, genres
, year
, publisher
, score
The metadata for each game has the suffix .metadata.json
(For example, s3://$bucket_name/video_game/$game_id.csv.metadata.json
) with the following scheme:
Create a knowledge base for amazon Bedrock
For instructions on creating a new knowledge base, see Create a knowledge base. For this example, we use the following configuration:
- About him Configure data source page, below fragmentation strategyselect no fragmentationbecause you already pre-processed the documents in the previous step.
- In it Embedding model section, choose Titan G1 Inlays – Text.
- In it Vector database section, choose Quickly create a new vector store. The metadata filtering feature is available for all supported vector stores.
Synchronize the dataset with the knowledge base
After the knowledge base is created and its data files and metadata files are in an amazon Simple Storage Service (amazon S3) bucket, you can start incremental ingestion. For instructions, see Synchronize to incorporate your data sources into the knowledge base.
Query with metadata filtering in the amazon Bedrock console
To use the metadata filtering options in the amazon Bedrock console, complete the following steps:
- In the amazon Bedrock console, choose Knowledge bases in the navigation panel.
- Choose the knowledge base you created.
- Choose Test knowledge base.
- Choose the Settings icon, then expand Filters.
- Enter a condition using the format: key = value (e.g. genres = Strategy) and press Get into.
- To change the key, value, or operator, choose the condition.
- Continue with the remaining conditions (e.g. (genres = Strategy AND year >= 2023) OR (rating >= 9))
- When you finish, enter your inquiry in the message box, then choose Run.
For this post, we entered the query “A strategy game with cool graphics released after 2023.”
Query with metadata filtering using the SDK
To use the SDK, first create the client for the Agents for amazon Bedrock runtime:
Then build the filter (the following are some examples):
Pass the filter to retrievalConfiguration
from the Retrieve API or RetrieveAndGenerate API:
The following table lists some answers with different metadata filtering conditions.
Consultation | Metadata filtering | Recovered documents | Observations |
“A strategy game with great graphics released after 2023” | Off |
* Viking Saga: The Sea Raider, year: 2023, genres: Strategy * Medieval Castle: Siege and Conquest, year:2022genres: Strategy *Cybernetic Revolution: The Rise of the Machines, year:2022genres: Strategy |
2/5 games meet the condition (genres = Strategy and year >= 2023) |
In | * Viking Saga: The Sea Raider, year: 2023, genres: Strategy * Fantasy Kingdoms: Chronicles of Eldoria, year: 2023, genres: Strategy |
2/2 games meet the condition (genres = Strategy and year >= 2023) |
In addition to custom metadata, you can also filter using S3 prefixes (which are built-in metadata, so you don't need to provide any metadata files). For example, if you organize game documents into prefixes by publisher (for example, s3://$bucket_name/video_game/$publisher/$game_id.csv
), you can filter with the specific editor (e.g. neo_tokyo_games
) using the following syntax:
Clean
To clean up your resources, complete the following steps:
- Delete the knowledge base:
- In the amazon Bedrock console, choose Knowledge bases low Orchestration in the navigation panel.
- Choose the knowledge base you created.
- Take note of the name of the AWS Identity and Access Management (IAM) service role in the Knowledge Base Overview section.
- In it Vector database section, take note of the collection's ARN.
- Choose Deleteand then enter delete to confirm.
- Delete the vector database:
- In the amazon OpenSearch Service console, choose Collections low Serverless in the navigation panel.
- Enter the ARN of the collection you saved in the search bar.
- Select the collection and choose Delete.
- Enter confirm at the confirmation prompt, then choose Delete.
- Remove the role from the IAM service:
- In the IAM console, choose Roles in the navigation panel.
- Find the name of the role you noted earlier.
- Select the role and choose Delete.
- Enter the role name in the confirmation message and delete the role.
- Delete the sample data set:
- In the amazon S3 console, navigate to the S3 bucket you used.
- Select the prefix and files, then choose Delete.
- Enter delete permanently at the confirmation prompt to delete.
Conclusion
In this post, we cover the metadata filtering feature in amazon Bedrock knowledge bases. You learned how to add custom metadata to documents and use it as filters while you retrieve and query documents using the amazon Bedrock console and SDK. This helps improve context accuracy, making query responses even more relevant, and at the same time achieves a reduction in the cost of querying the vector database.
For additional resources, see the following:
About the authors
Corvus Lee is a Senior Solutions Architect at GenAI Labs based in London. He is passionate about designing and developing prototypes that use generative ai to solve customer problems. He also stays up to date with the latest advances in generative ai and recovery techniques by applying them to real-world scenarios.
Ahmed Ewis is a Senior Solutions Architect at AWS GenAI Labs, helping customers prototype generative ai to solve business problems. When he is not collaborating with clients, he likes to play with his children and cook.
Chris Pecora is a Generative ai Data Scientist at amazon Web Services. He is passionate about creating innovative products and solutions while focusing on customer-obsessed science. When he's not performing experiments and keeping up with the latest developments in GenAI, he loves spending time with his children.