Metadata Filtering for Tabular Data with Knowledge Bases for Amazon Bedrock

amazon Bedrock is a fully managed service that offers a selection of high-performance knowledge base models (FMs) from leading artificial intelligence (ai) companies, including AI21 Labs, Anthropic, Cohere, Meta, Mistral ai, Stability ai, and amazon, through a single API. To equip FMs with up-to-date and unique insights, organizations use Retrieval Augmented Generation (RAG), a technique that ingests data from enterprise data sources and enriches the message to deliver more relevant and accurate responses. Knowledge Bases for amazon Bedrock is a fully managed capability that helps you implement the entire RAG workflow, from ingestion to retrieval to message augmentation. However, information about one dataset can be in another dataset, called metadata. Without the use of metadata, your retrieval process can result in retrieval of unrelated results, which decreases FM accuracy and increases the cost in the FM message token.

On March 27, 2024, amazon Bedrock announced a key new feature called metadata filtering and also changed the default engine. This change allows you to use metadata fields during the retrieval process. However, metadata fields need to be configured during the knowledge base ingestion process. Often, you may have tabular data where details about one field are available in another field. Also, you may have the requirement to cite the exact text document or text field to avoid hallucinations. In this post, we show you how to use the new metadata filtering feature with Knowledge Bases for amazon Bedrock for such tabular data.

Solution Overview

The solution consists of the following high-level steps:

Prepare data for metadata filtering.
Create and incorporate data and metadata into the knowledge base.
Retrieve data from the knowledge base by filtering metadata.

Preparing data for metadata filtering

At the time of writing this article, knowledge bases for amazon Bedrock support amazon OpenSearch Serverless, amazon Aurora, amazon-bedrock” target=”_blank” rel=”noopener”>Pineapple, Redis Enterpriseand amazon-bedrock” target=”_blank” rel=”noopener”>MongoDB Atlas as underlying vector storage providers. In this post, we create and access an OpenSearch Serverless vector store using the amazon Bedrock Boto3 SDK. For more details, see Set Up a Vector Index for Your Knowledge Base on a Supported Vector Store.

For this post, we created a knowledge base using the public dataset Food.com – Recipes and ReviewsThe following screenshot shows an example of the dataset.

He TotalTime It is in ISO 8601 format. You can convert it to minutes using the following logic:

# Function to convert ISO 8601 duration to minutes
def convert_to_minutes(duration):
    hours = 0
    minutes = 0
    
    # Find hours and minutes using regex
    match = re.match(r'PT(?:(\d+)H)?(?:(\d+)M)?', duration)
    
    if match:
        if match.group(1):
            hours = int(match.group(1))
        if match.group(2):
            minutes = int(match.group(2))
    
    # Convert total time to minutes
    total_minutes = hours * 60 + minutes
    return total_minutes

df('TotalTimeInMinutes') = df('TotalTime').apply(convert_to_minutes)

After converting some of the features like CholesterolContent, SugarContent, and RecipeInstructionsThe data frame looks like the following screenshot.

To allow the FM to point to a specific menu with a link (cite the document), we split each row of the tabular data into a single text file, and each file contains RecipeInstructions as the data field and TotalTimeInMinutes, CholesterolContent, and SugarContent as metadata. Metadata should be saved in a separate JSON file with the same name as the data file and .metadata.json is added to its name. For example, if the name of the data file is 100.txtThe name of the metadata file must be 100.txt.metadata.jsonFor more details, see Add metadata to your files to enable filtering. Additionally, the content of the metadata file must be in the following format:

{
"metadataAttributes": {
"${attribute1}": "${value1}",
"${attribute2}": "${value2}",
...
}
}

For simplicity, we only process the top 2000 rows to create the knowledge base.

After importing the required libraries, create a local directory using the following Python code:

import pandas as pd
import os, json, tqdm, boto3

metafolder="multi_file_recipe_data"os.mkdir(metafolder)

Iterate over the top 2000 rows to create data and metadata files to store in the local folder:

for i in tqdm.trange(2000):
    desc = str(df('RecipeInstructions')(i))
    meta = {
    "metadataAttributes": {
        "Name": str(df('Name')(i)),
        "TotalTimeInMinutes": str(df('TotalTimeInMinutes')(i)),
        "CholesterolContent": str(df('CholesterolContent')(i)),
        "SugarContent": str(df('SugarContent')(i)),
    }
    }
    filename = metafolder+'/' + str(i+1)+ '.txt'
    f = open(filename, 'w')
    f.write(desc)
    f.close()
    metafilename = filename+'.metadata.json'
    with open( metafilename, 'w') as f:
        json.dump(meta, f)

Create an amazon Simple Storage Service (amazon S3) bucket named food-kb and upload the files:

# Upload data to s3
s3_client = boto3.client("s3")
bucket_name = "recipe-kb"
data_root = metafolder+'/'
def uploadDirectory(path,bucket_name):
    for root,dirs,files in os.walk(path):
        for file in tqdm.tqdm(files):
            s3_client.upload_file(os.path.join(root,file),bucket_name,file)

uploadDirectory(data_root, bucket_name)

Create and incorporate data and metadata into the knowledge base

Once the S3 folder is ready, you can create the knowledge base in the amazon Bedrock console using the SDK according to this example notebook.

Retrieving data from knowledge base by filtering metadata

Now, let's retrieve some data from the knowledge base. For this post, we're using Anthropic Claude Sonnet on amazon Bedrock for our FM, but you can choose from a variety of amazon Bedrock models. First, you need to set the following variables, where kb_id is the ID of your knowledge base. The knowledge base ID can be found programmatically, as shown in Figure 1. amazon-bedrock-samples/blob/main/knowledge-bases/01-rag-concepts/1b_create_ingest_documents_test_kb_multi_ds.ipynb” target=”_blank” rel=”noopener”>example of notebookor from the amazon Bedrock console by navigating to the individual knowledge base, as shown in the following screenshot.

Set the required amazon Bedrock parameters using the following code:

import boto3
import pprint
from botocore.client import Config
import json

pp = pprint.PrettyPrinter(indent=2)
session = boto3.session.Session()
region = session.region_name
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime', region_name = region)
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config, region_name = region)
kb_id = "EIBBXVFDQP"
model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'

# retrieve api for fetching only the relevant context.

query = " Tell me a recipe that I can make under 30 minutes and has cholesterol less than 10 "

relevant_documents = bedrock_agent_runtime_client.retrieve(
    retrievalQuery= {
        'text': query
    },
    knowledgeBaseId=kb_id,
    retrievalConfiguration= {
        'vectorSearchConfiguration': {
            'numberOfResults': 2 
        }
    }
)
pp.pprint(relevant_documents("retrievalResults"))

The following code is the result of retrieving the knowledge base without metadata filtering for the query “Tell me a recipe that I can make in less than 30 minutes and has a cholesterol below 10”. As we can see, for the two recipes, the preparation durations are 30 and 480 minutes, respectively, and the cholesterol contents are 86 and 112.4, respectively. Therefore, the retrieval does not follow the query precisely.

The following code demonstrates how to use the retrieval API with metadata filters set to cholesterol content less than 10 and preparation minutes less than 30 for the same query:

def retrieve(query, kbId, numberOfResults=5):
    return bedrock_agent_client.retrieve(
        retrievalQuery= {
            'text': query
        },
        knowledgeBaseId=kbId,
        retrievalConfiguration= {
            'vectorSearchConfiguration': {
                'numberOfResults': numberOfResults,
                 "filter": {
                            'andAll':(
                                {
                                "lessThan": {
                                "key": "CholesterolContent",
                                "value": 10
                                }
                            },
                                {
                            "lessThan": {
                                "key": "TotalTimeInMinutes",
                                "value": 30
                            }
                                }
                            )
                        }
            }
        }
    ) 
query = "Tell me a recipe that I can make under 30 minutes and has cholesterol less than 10" 
response = retrieve(query, kb_id, 2)
retrievalResults = response('retrievalResults')
pp.pprint(retrievalResults)

As we can see from the following results, for the two recipes, the preparation times are 27 and 20, respectively, and the cholesterol contents are 0 and 0, respectively. By using metadata filtering, we get more accurate results.

The following code shows how to get accurate output using the same metadata filtering with the retrieve_and_generate API. First, we configure the message and then we configure the API with metadata filtering:

prompt = f"""
Human: You have great knowledge about food, so provide answers to questions by using fact. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Assistant:"""

def retrieve_and_generate(query, kb_id,modelId, numberOfResults=10):
    return bedrock_agent_client.retrieve_and_generate(
        input= {
            'text': query,
        },
        retrieveAndGenerateConfiguration={
        'knowledgeBaseConfiguration': {
            'generationConfiguration': {
                'promptTemplate': {
                    'textPromptTemplate': f"{prompt} $search_results$"
                }
            },
            'knowledgeBaseId': kb_id,
            'modelArn': model_id,
            'retrievalConfiguration': {
                'vectorSearchConfiguration': {
                    'numberOfResults': numberOfResults,
                    'overrideSearchType': 'HYBRID',
                     "filter": {
                            'andAll':(
                                {
                                "lessThan": {
                                "key": "CholesterolContent",
                                "value": 10
                                }
                            },
                                {
                            "lessThan": {
                                "key": "TotalTimeInMinutes",
                                "value": 30
                            }
                                }
                            )
                        },
                }
        }
                    },
        'type': 'KNOWLEDGE_BASE'
    }
    )
    
query = "Tell me a recipe that I can make under 30 minutes and has cholesterol less than 10"
response = retrieve_and_generate(query, kb_id,modelId, numberOfResults=10)
pp.pprint(response('output')('text'))

As we can see in the following output, the model returns a detailed recipe that follows the indicated metadata filtering of less than 30 minutes preparation time and a cholesterol content less than 10.

Clean

Make sure to comment out the following section if you plan to use the knowledge base you created to build your RAG application. If you just want to test building the knowledge base using the SDK, make sure to delete all the resources that were created because you will incur costs for storing documents in the OpenSearch Serverless index. See the following code:

bedrock_agent_client.delete_data_source(dataSourceId = ds("dataSourceId"), knowledgeBaseId=kb('knowledgeBaseId'))
bedrock_agent_client.delete_knowledge_base(knowledgeBaseId=kb('knowledgeBaseId'))
oss_client.indices.delete(index=index_name)
aoss_client.delete_collection(id=collection_id)
aoss_client.delete_access_policy(type="data", name=access_policy('accessPolicyDetail')('name'))
aoss_client.delete_security_policy(type="network", name=network_policy('securityPolicyDetail')('name'))
aoss_client.delete_security_policy(type="encryption", name=encryption_policy('securityPolicyDetail')('name'))
# Delete roles and polices 
iam_client.delete_role(RoleName=bedrock_kb_execution_role)
iam_client.delete_policy(PolicyArn=policy_arn)

Conclusion

In this post, we explain how to split a large tabular dataset into rows to set up a knowledge base with metadata for each of those records and how to retrieve the results with metadata filtering. We also show how retrieving results with metadata is more accurate than retrieving results without metadata filtering. Finally, we show how to use the result with an FM to get accurate results.

To further explore knowledge base capabilities for amazon Bedrock, see the following resources:

About the Author

Tanay Chowdhury is a Data Scientist at the amazon Web Services Generative ai Innovation Center. He helps customers solve their business problems using Generative ai and Machine Learning.

Metadata Filtering for Tabular Data with Knowledge Bases for Amazon Bedrock

Technical Terrence Team

Chaos at CrowdStrike could prompt a rethink among investors and customers By Reuters

Leave a Reply Cancel reply

Recommended.

Defi Investors Eye Borroe Finance, Aave and Uniswap

Bitcoin Whales Control 94.5% of Exchange Volume – Selling Patterns Suggest Change Ahead

Sonos Black Friday deals discount Era 100 smart speaker to $199

Stock futures fall

Save $150 on our favorite Ooni pizza oven, plus the rest of this week's best tech deals

Categories

Important Links

Metadata Filtering for Tabular Data with Knowledge Bases for Amazon Bedrock

Solution Overview

Preparing data for metadata filtering

Create and incorporate data and metadata into the knowledge base

Retrieving data from knowledge base by filtering metadata

Clean

Conclusion

About the Author

Related

Technical Terrence Team

Chaos at CrowdStrike could prompt a rethink among investors and customers By Reuters

Leave a Reply Cancel reply

Recommended.

Defi Investors Eye Borroe Finance, Aave and Uniswap

Bitcoin Whales Control 94.5% of Exchange Volume – Selling Patterns Suggest Change Ahead

Sonos Black Friday deals discount Era 100 smart speaker to $199

Stock futures fall

Save $150 on our favorite Ooni pizza oven, plus the rest of this week's best tech deals

Categories

Important Links

Get daily news updates to your inbox!