amazon Bedrock is a fully managed service that offers a selection of high-performance knowledge base models (FMs) from leading artificial intelligence (ai) companies, including AI21 Labs, Anthropic, Cohere, Meta, Mistral ai, Stability ai, and amazon, through a single API. To equip FMs with up-to-date and unique insights, organizations use Retrieval Augmented Generation (RAG), a technique that ingests data from enterprise data sources and enriches the message to deliver more relevant and accurate responses. Knowledge Bases for amazon Bedrock is a fully managed capability that helps you implement the entire RAG workflow, from ingestion to retrieval to message augmentation. However, information about one dataset can be in another dataset, called metadata. Without the use of metadata, your retrieval process can result in retrieval of unrelated results, which decreases FM accuracy and increases the cost in the FM message token.
On March 27, 2024, amazon Bedrock announced a key new feature called metadata filtering and also changed the default engine. This change allows you to use metadata fields during the retrieval process. However, metadata fields need to be configured during the knowledge base ingestion process. Often, you may have tabular data where details about one field are available in another field. Also, you may have the requirement to cite the exact text document or text field to avoid hallucinations. In this post, we show you how to use the new metadata filtering feature with Knowledge Bases for amazon Bedrock for such tabular data.
Solution Overview
The solution consists of the following high-level steps:
- Prepare data for metadata filtering.
- Create and incorporate data and metadata into the knowledge base.
- Retrieve data from the knowledge base by filtering metadata.
Preparing data for metadata filtering
At the time of writing this article, knowledge bases for amazon Bedrock support amazon OpenSearch Serverless, amazon Aurora, amazon-bedrock” target=”_blank” rel=”noopener”>Pineapple, Redis Enterpriseand amazon-bedrock” target=”_blank” rel=”noopener”>MongoDB Atlas as underlying vector storage providers. In this post, we create and access an OpenSearch Serverless vector store using the amazon Bedrock Boto3 SDK. For more details, see Set Up a Vector Index for Your Knowledge Base on a Supported Vector Store.
For this post, we created a knowledge base using the public dataset Food.com – Recipes and ReviewsThe following screenshot shows an example of the dataset.
He TotalTime
It is in ISO 8601 format. You can convert it to minutes using the following logic:
After converting some of the features like CholesterolContent, SugarContent,
and RecipeInstructions
The data frame looks like the following screenshot.
To allow the FM to point to a specific menu with a link (cite the document), we split each row of the tabular data into a single text file, and each file contains RecipeInstructions
as the data field and TotalTimeInMinutes, CholesterolContent,
and SugarContent
as metadata. Metadata should be saved in a separate JSON file with the same name as the data file and .metadata.json
is added to its name. For example, if the name of the data file is 100.txt
The name of the metadata file must be 100.txt.metadata.json
For more details, see Add metadata to your files to enable filtering. Additionally, the content of the metadata file must be in the following format:
For simplicity, we only process the top 2000 rows to create the knowledge base.
- After importing the required libraries, create a local directory using the following Python code:
- Iterate over the top 2000 rows to create data and metadata files to store in the local folder:
- Create an amazon Simple Storage Service (amazon S3) bucket named
food-kb
and upload the files:
Create and incorporate data and metadata into the knowledge base
Once the S3 folder is ready, you can create the knowledge base in the amazon Bedrock console using the SDK according to this example notebook.
Retrieving data from knowledge base by filtering metadata
Now, let's retrieve some data from the knowledge base. For this post, we're using Anthropic Claude Sonnet on amazon Bedrock for our FM, but you can choose from a variety of amazon Bedrock models. First, you need to set the following variables, where kb_id is the ID of your knowledge base. The knowledge base ID can be found programmatically, as shown in Figure 1. amazon-bedrock-samples/blob/main/knowledge-bases/01-rag-concepts/1b_create_ingest_documents_test_kb_multi_ds.ipynb” target=”_blank” rel=”noopener”>example of notebookor from the amazon Bedrock console by navigating to the individual knowledge base, as shown in the following screenshot.
Set the required amazon Bedrock parameters using the following code:
The following code is the result of retrieving the knowledge base without metadata filtering for the query “Tell me a recipe that I can make in less than 30 minutes and has a cholesterol below 10”. As we can see, for the two recipes, the preparation durations are 30 and 480 minutes, respectively, and the cholesterol contents are 86 and 112.4, respectively. Therefore, the retrieval does not follow the query precisely.
The following code demonstrates how to use the retrieval API with metadata filters set to cholesterol content less than 10 and preparation minutes less than 30 for the same query:
As we can see from the following results, for the two recipes, the preparation times are 27 and 20, respectively, and the cholesterol contents are 0 and 0, respectively. By using metadata filtering, we get more accurate results.
The following code shows how to get accurate output using the same metadata filtering with the retrieve_and_generate
API. First, we configure the message and then we configure the API with metadata filtering:
As we can see in the following output, the model returns a detailed recipe that follows the indicated metadata filtering of less than 30 minutes preparation time and a cholesterol content less than 10.
Clean
Make sure to comment out the following section if you plan to use the knowledge base you created to build your RAG application. If you just want to test building the knowledge base using the SDK, make sure to delete all the resources that were created because you will incur costs for storing documents in the OpenSearch Serverless index. See the following code:
Conclusion
In this post, we explain how to split a large tabular dataset into rows to set up a knowledge base with metadata for each of those records and how to retrieve the results with metadata filtering. We also show how retrieving results with metadata is more accurate than retrieving results without metadata filtering. Finally, we show how to use the result with an FM to get accurate results.
To further explore knowledge base capabilities for amazon Bedrock, see the following resources:
About the Author
Tanay Chowdhury is a Data Scientist at the amazon Web Services Generative ai Innovation Center. He helps customers solve their business problems using Generative ai and Machine Learning.