The report <a target="_blank" href="https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/the-economic-potential-of-generative-ai-the-next-productivity-frontier#introduction” target=”_blank” rel=”noopener”>The economic potential of generative ai: The next productivity frontier, published by McKinsey & Company, estimates that generative ai could add an equivalent of $2.6 trillion to $4.4 trillion in value to the global economy. The largest value will be added across four areas: customer operations, marketing and sales, software engineering, and R&D.
The potential for such large business value is galvanizing tens of thousands of enterprises to build their generative ai applications in AWS. However, many product managers and enterprise architect leaders want a better understanding of the costs, cost-optimization levers, and sensitivity analysis.
This post addresses these cost considerations so you can optimize your generative ai costs in AWS.
The post assumes a basic familiarity of foundation model (FMs) and large language models (LLMs), tokens, vector embeddings, and vector databases in AWS. With Retrieval Augmented Generation (RAG) being one of the most common frameworks used in generative ai solutions, the post explains costs in the context of a RAG solution and respective optimization pillars on amazon Bedrock.
In Part 2 of this series, we will cover how to estimate business value and the influencing factors.
Cost and performance optimization pillars
Designing performant and cost-effective generative ai applications is essential for realizing the full potential of this transformative technology and driving widespread adoption within your organization.
Forecasting and managing costs and performance in generative ai applications is driven by the following optimization pillars:
- Model selection, choice, and customization – We define these as follows:
- Model selection – This process involves identifying the optimal model that meets a wide variety of use cases, followed by model validation, where you benchmark against high-quality datasets and prompts to identify successful model contenders.
- Model choice – This refers to the choice of an appropriate model because different models have varying pricing and performance attributes.
- Model customization – This refers to choosing the appropriate techniques to customize the FMs with training data to optimize the performance and cost-effectiveness according to business-specific use cases.
- Token usage – Analyzing token usage consists of the following:
- Token count – The cost of using a generative ai model depends on the number of tokens processed. This can directly impact the cost of an operation.
- Token limits – Understanding token limits and what drives token count, and putting guardrails in place to limit token count can help you optimize token costs and performance.
- Token caching – Caching at the application layer or LLM layer for commonly asked user questions can help reduce the token count and improve performance.
- Inference pricing plan and usage patterns – We consider two pricing options:
- On-Demand – Ideal for most models, with charges based on the number of input/output tokens, with no guaranteed token throughput.
- Provisioned Throughput – Ideal for workloads demanding guaranteed throughput, but with relatively higher costs.
- Miscellaneous factors – Additional factors can include:
- Security guardrails – Applying content filters for personally identifiable information (PII), harmful content, undesirable topics, and detecting hallucinations improves the safety of your generative ai application. These filters can perform and scale independently of LLMs and have costs that are directly proportional to the number of filters and the tokens examined.
- Vector database – The vector database is a critical component of most generative ai applications. As the amount of data usage in your generative ai application grows, vector database costs can also grow.
- Chunking strategy – Chunking strategies such as fixed size chunking, hierarchical chunking, or semantic chunking can influence the accuracy and costs of your generative ai application.
Let’s dive deeper to examine these factors and associated cost-optimization tips.
Retrieval Augmented Generation
RAG helps an LLM answer questions specific to your corporate data, even though the LLM was never trained on your data.
As illustrated in the following diagram, the generative ai application reads your corporate trusted data sources, chunks it, generates vector embeddings, and stores the embeddings in a vector database. The vectors and data stored in a vector database are often called a knowledge base.
The generative ai application uses the vector embeddings to search and retrieve chunks of data that are most relevant to the user’s question and augment the question to generate the LLM response. The following diagram illustrates this workflow.
The workflow consists of the following steps:
- A user asks a question using the generative ai application.
- A request to generate embeddings is sent to the LLM.
- The LLM returns embeddings to the application.
- These embeddings are searched against vector embeddings stored in a vector database (knowledge base).
- The application receives context relevant to the user question from the knowledge base.
- The application sends the user question and the context to the LLM.
- The LLM uses the context to generate an accurate and grounded response.
- The application sends the final response back to the user.
amazon Bedrock is a fully managed service providing access to high-performing FMs from leading ai providers through a unified API. It offers a wide range of LLMs to choose from.
In the preceding workflow, the generative ai application invokes amazon Bedrock APIs to send text to an LLM like amazon Titan Embeddings V2 to generate text embeddings, and to send prompts to an LLM like Anthropic’s Claude Haiku or Meta Llama to generate a response.
The generated text embeddings are stored in a vector database such as amazon OpenSearch Service, amazon Relational Database Service (amazon RDS), amazon Aurora, or amazon MemoryDB.
A generative ai application such as a virtual assistant or support chatbot might need to carry a conversation with users. A multi-turn conversation requires the application to store a per-user question-answer history and send it to the LLM for additional context. This question-answer history can be stored in a database such as amazon DynamoDB.
The generative ai application could also use amazon Bedrock Guardrails to detect off-topic questions, ground responses to the knowledge base, detect and redact PII information, and detect and block hate or violence-related questions and answers.
Now that we have a good understanding of the various components in a RAG-based generative ai application, let’s explore how these factors influence costs while running your application in AWS using RAG.
Directional costs for small, medium, large, and extra large scenarios
Consider an organization that wants to help their customers with a virtual assistant that can answer their questions any time with a high degree of accuracy, performance, consistency, and safety. The performance and cost of the generative ai application depends directly on a few major factors in the environment, such as the velocity of questions per minute, the volume of questions per day (considering peak and off-peak), the amount of knowledge base data, and the LLM that is used.
Although this post explains the factors that influence costs, it can be useful to know the directional costs, based on some assumptions, to get a relative understanding of various cost components for a few scenarios such as small, medium, large, and extra large environments.
The following table is a snapshot of directional costs for four different scenarios with varying volume of user questions per month and knowledge base data.
. | SMALL | MEDIUM | LARGE | EXTRA LARGE |
INPUTs | 500,000 | 2,000,000 | 5,000,000 | 7,020,000 |
Total questions per month | 5 | 25 | 50 | 100 |
Knowledge base data size in GB (actual text size on documents) | . | . | . | . |
Annual costs (directional)* | . | . | . | . |
amazon Bedrock On-Demand costs using Anthropic’s Claude 3 Haiku | $5,785 | $23,149 | $57,725 | $81,027 |
amazon OpenSearch Service provisioned cluster costs | $6,396 | $13,520 | $20,701 | $39,640 |
amazon Bedrock Titan Text Embedding v2 costs | $396 | $5,826 | $7,320 | $13,585 |
Total annual costs (directional) | $12,577 | $42,495 | $85,746 | $134,252 |
Unit cost per 1,000 questions (directional) | $2.10 | $1.80 | $1.40 | $1.60 |
These costs are based on assumptions. Costs will vary if assumptions change. Cost estimates will vary for each customer. The data in this post should not be used as a quote and does not guarantee the cost for actual use of AWS services. The costs, limits, and models can change over time.
For the sake of brevity, we use the following assumptions:
- amazon Bedrock On-Demand pricing model
- Anthropic’s Claude 3 Haiku LLM
- AWS Region us-east-1
- Token assumptions for each user question:
- Total input tokens to LLM = 2,571
- Output tokens from LLM = 149
- Average of four characters per token
- Total tokens = 2,720
- There are other cost components such as DynamoDB to store question-answer history, amazon Simple Storage Service (amazon S3) to store data, and AWS Lambda or amazon Elastic Container Service (amazon ECS) to invoke amazon Bedrock APIs. However, these costs are not as significant as the cost components mentioned in the table.
We refer to this table in the remainder of this post. In the next few sections, we will cover amazon Bedrock costs and the key factors influences its costs, vector embedding costs, vector database costs, and amazon Bedrock Guardrails costs. In the final section, we will cover how chunking strategies will influence some of the above cost components.
amazon Bedrock costs
amazon Bedrock has two pricing models: On-Demand (used in the preceding example scenario) and Provisioned Throughput.
With the On-Demand model, an LLM has a maximum requests (questions) per minute (RPM) and tokens per minute (TPM) limit. The RPM and TPM are typically different for each LLM. For more information, see Quotas for amazon Bedrock.
In the extra large use case, with 7 million questions per month, assuming 10 hours per day and 22 business days per month, it translates to 532 questions per minute (532 RPM). This is well below the maximum limit of 1,000 RPM for Anthropic’s Claude 3 Haiku.
With 2,720 average tokens per question and 532 requests per minute, the TPM is 2,720 x 532 = 1,447,040, which is well below the maximum limit of 2,000,000 TPM for Anthropic’s Claude 3 Haiku.
However, assume that the user questions grow by 50%. The RPM, TPM, or both might cross the thresholds. In such cases where the generative ai application needs cross the On-Demand RPM and TPM thresholds, you should consider the amazon Bedrock Provisioned Throughput model.
With amazon Bedrock Provisioned Throughput, cost is based on a per-model unit basis. Model units are dedicated for the duration you plan to use, such as an hourly, 1-month, 6-month commitment.
Each model unit offers a certain capacity of maximum tokens per minute. Therefore, the number of model units (and the costs) are determined by the input and output TPM.
With amazon Bedrock Provisioned Throughput, you incur charges per model unit whether you use it or not. Therefore, the Provisioned Throughput model is relatively more expensive than the On-Demand model.
Consider the following cost-optimization tips:
- Start with the On-Demand model and test for your performance and latency with your choice of LLM. This will deliver the lowest costs.
- If On-Demand can’t satisfy the desired volume of RPM or TPM, start with Provisioned Throughput with a 1-month subscription during your generative ai application beta period. However, for steady state production, consider a 6-month subscription to lower the Provisioned Throughput costs.
- If there are shorter peak hours and longer off-peak hours, consider using a Provisioned Throughput hourly model during the peak hours and On-Demand during the off-peak hours. This can minimize your Provisioned Throughput costs.
Factors influencing costs
In this section, we discuss various factors that can influence costs.
Number of questions
Cost grows as the number of questions grow with the On-Demand model, as can be seen in the following figure for annual costs (based on the table discussed earlier).
Input tokens
The main sources of input tokens to the LLM are the system prompt, user prompt, context from the vector database (knowledge base), and context from QnA history, as illustrated in the following figure.
As the size of each component grows, the number of input tokens to the LLM grows, and so does the costs.
Generally, user prompts are relatively small. For example, in the user prompt “What are the performance and cost optimization strategies for amazon DynamoDB?”, assuming four characters per token, there are approximately 20 tokens.
System prompts can be large (and therefore the costs are higher), especially for multi-shot prompts where multiple examples are provided to get LLM responses with better tone and style. If each example in the system prompt uses 100 tokens and there are three examples, that’s 300 tokens, which is quite larger than the actual user prompt.
Context from the knowledge base tends to be the largest. For example, when the documents are chunked and text embeddings are generated for each chunk, assume that the chunk size is 2,000 characters. Assume that the generative ai application sends three chunks relevant to the user prompt to the LLM. This is 6,000 characters. Assuming four characters per token, this translates to 1,500 tokens. This is much higher compared to a typical user prompt or system prompt.
Context from QnA history can also be high. Assume an average of 20 tokens in the user prompt and 100 tokens in LLM response. Assume that the generative ai application sends a history of three question-answer pairs along with each question. This translates to (20 tokens per question + 100 tokens per response) x 3 question-answer pairs = 360 tokens.
Consider the following cost-optimization tips:
- Limit the number of characters per user prompt
- Test the accuracy of responses with various numbers of chunks and chunk sizes from the vector database before finalizing their values
- For generative ai applications that need to carry a conversation with a user, test with two, three, four, or five pairs of QnA history and then pick the optimal value
Output tokens
The response from the LLM will depend on the user prompt. In general, the pricing for output tokens is three to five times higher than the pricing for input tokens.
Consider the following cost-optimization tips:
- Because the output tokens are expensive, consider specifying the maximum response size in your system prompt
- If some users belong to a group or department that requires higher token limits on the user prompt or LLM response, consider using multiple system prompts in such a way that the generative ai application picks the right system prompt depending on the user
Vector embedding costs
As explained previously, in a RAG application, the data is chunked, and text embeddings are generated and stored in a vector database (knowledge base). The text embeddings are generated by invoking the amazon Bedrock API with an LLM, such as amazon Titan Text Embeddings V2. This is independent of the amazon Bedrock model you choose for inferencing, such as Anthropic’s Claude Haiku or other LLMs.
The pricing to generate text embeddings is based on the number of input tokens. The greater the data, the greater the input tokens, and therefore the higher the costs.
For example, with 25 GB of data, assuming four characters per token, input tokens total 6,711 million. With the amazon Bedrock On-Demand costs for amazon Titan Text Embeddings V2 as $0.02 per million tokens, the cost of generating embeddings is $134.22.
However, On-Demand has an RPM limit of 2,000 for amazon Titan Text Embeddings V2. With 2,000 RPM, it will take 112 hours to embed 25 GB of data. Because this is a one-time job of embedding data, this might be acceptable in most scenarios.
For monthly change rate and new data of 5% (1.25 GB per month), the time required will be 6 hours.
In rare situations where the actual text data is very high in TBs, Provisioned Throughput will be needed to generate text embeddings. For example, to generate text embeddings for 500 GB in 3, 6, and 9 days, it will be approximately $60,000, $33,000, or $24,000 one-time costs using Provisioned Throughput.
Typically, the actual text inside a file is 5–10 times smaller than the file size reported by amazon S3 or a file system. Therefore, when you see 100 GB size for all your files that need to be vectorized, there is a high probability that the actual text inside the files will be 2–20 GB.
One way to estimate the text size inside files is with the following steps:
- Pick 5–10 sample representations of the files.
- Open the files, copy the content, and enter it into a Word document.
- Use the word count feature to identify the text size.
- Calculate the ratio of this size with the file system reported size.
- Apply this ratio to the total file system to get a directional estimate of actual text size inside all the files.
Vector database costs
AWS offers many vector databases, such as OpenSearch Service, Aurora, amazon RDS, and MemoryDB. As explained earlier in this post, the vector database plays a critical role in grounding responses to your enterprise data whose vector embeddings are stored in a vector database.
The following are some of the factors that influence the costs of vector database. For the sake of brevity, we consider an OpenSearch Service provisioned cluster as the vector database.
- Amount of data to be used as the knowledge base – Costs are directly proportional to data size. More data means more vectors. More vectors mean more indexes in a vector database, which in turn requires more memory and therefore higher costs. For best performance, it’s recommended to size the vector database so that all the vectors are stored in memory.
- Index compression – Vector embeddings can be indexed by HNSW or IVF algorithms. The index can also be compressed. Although compressing the indexes can reduce the memory requirements and costs, it might lose accuracy. Therefore, consider doing extensive testing for accuracy before deciding to use compression variants of HNSW or IVF. For example, for a large text data size of 100 GB, assuming 2,000 bytes of chunk size, 15% overlap, vector dimension count of 512, no upfront Reserved Instance for 3 years, and HNSW algorithm, the approximate costs are $37,000 per year. The corresponding costs with compression using hnsw-fp16 and hnsw-pq are $21,000 and $10,000 per year, respectively.
- Reserved Instances – Cost is inversely proportional to the number of years you reserve the cluster instance that stores the vector database. For example, in the preceding scenario, an On-Demand instance would cost approximately, $75,000 per year, a no upfront 1-year Reserved Instance would cost $52,000 per year, and a no upfront 3-year Reserved Instance would cost $37,000 per year.
Other factors, such as the number of retrievals from the vector database that you pass as context to the LLM, can influence input tokens and therefore costs. But in general, the preceding factors are the most important cost drivers.
amazon Bedrock Guardrails
Let’s assume your generative ai virtual assistant is supposed to answer questions related to your products for your customers on your website. How will you avoid users asking off-topic questions such as science, religion, geography, politics, or puzzles? How do you avoid responding to user questions on hate, violence, or race? And how can you detect and redact PII in both questions and responses?
The amazon Bedrock ApplyGuardrail API can help you solve these problems. Guardrails offer multiple policies such as content filters, denied topics, contextual grounding checks, and sensitive information filters (PII). You can selectively apply these filters to all or a specific portion of data such as user prompt, system prompt, knowledge base context, and LLM responses.
Applying all filters to all data will increase costs. Therefore, you should evaluate carefully which filter you want to apply on what portion of data. For example, if you want PII to be detected or redacted from the LLM response, for 2 million questions per month, approximate costs (based on output tokens mentioned earlier in this post) would be $200 per month. In addition, if your security team wants to detect or redact PII for user questions as well, the total amazon Bedrock Guardrails costs will be $400 per month.
Chunking strategies
As explained earlier in how RAG works, your data is chunked, embeddings are generated for those chunks, and the chunks and embeddings are stored in a vector database. These chunks of data are retrieved later and passed as context along with user questions to the LLM to generate a grounded and relevant response.
The following are different chunking strategies, each of which can influence costs:
- Standard chunking – In this case, you can specify default chunking, which is approximately 300 tokens, or fixed-size chunking, where you specify the token size (for example, 300 tokens) for each chunk. Larger chunks will increase input tokens and therefore costs.
- Hierarchical chunking – This strategy is useful when you want to chunk data at smaller sizes (for example, 300 tokens) but send larger pieces of chunks (for example, 1,500 tokens) to the LLM so the LLM has a bigger context to work with while generating responses. Although this can improve accuracy in some cases, this can also increase the costs because of larger chunks of data being sent to the LLM.
- Semantic chunking – This strategy is useful when you want chunking based on semantic meaning instead of just the token. In this case, a vector embedding is generated for one or three sentences. A sliding window is used to consider the next sentence and embeddings are calculated again to identify whether the next sentence is semantically similar or not. The process continues until you reach an upper limit of tokens (for example, 300 tokens) or you find a sentence that isn’t semantically similar. This boundary defines a chunk. The input token costs to the LLM will be similar to standard chunking (based on a maximum token size) but the accuracy might be better because of chunks having sentences that are semantically similar. However, this will increase the costs of generating vector embeddings because embeddings are generated for each sentence, and then for each chunk. But at the same time, these are one-time costs (and for new or changed data), which might be worth it if the accuracy is comparatively better for your data.
- Advanced parsing – This is an optional pre-step to your chunking strategy. This is used to identify chunk boundaries, which is especially useful when you have documents with a lot of complex data such as tables, images, and text. Therefore, the costs will be the input and output token costs for the entire data that you want to use for vector embeddings. These costs will be high. Consider using advanced parsing only for those files that have a lot of tables and images.
The following table is a relative cost comparison for various chunking strategies.
Chunking Strategy | Standard | Semantic | Hierarchical |
Relative Inference Costs | Low | Medium | High |
Conclusion
In this post, we discussed various factors that could impact costs for your generative ai application. This a rapidly evolving space, and costs for the components we mentioned could change in the future. Consider the costs in this post as a snapshot in time that is based on assumptions and is directionally accurate. If you have any questions, reach out to your AWS account team.
In Part 2, we discuss how to calculate business value and the factors that impact business value.
About the Authors
Vinnie Saini is a Senior Generative ai Specialist Solution Architect at amazon Web Services(AWS) based in Toronto, Canada. With a background in Machine Learning, she has over 15 years of experience designing & building transformational cloud based solutions for customers across industries. Her focus has been primarily scaling ai/ML based solutions for unparalleled business impacts, customized to business needs.
Chandra Reddy is a Senior Manager of Solution Architects team at amazon Web Services(AWS) in Austin, Texas. He and his team help enterprise customers in North America on their AIML and Generative ai use cases in AWS. He has more than 20 years of experience in software engineering, product management, product marketing, business development, and solution architecture.