Resiliency plays a critical role in the development of any workload, and generative ai workloads are no different. There are unique considerations when designing generative ai workloads through a resiliency lens. Understanding and prioritizing resilience is crucial for generative ai workloads to meet organizational availability and business continuity requirements. In this post, we look at the different stacks of a generative ai workload and what those considerations should be.
Full stack generative ai
Although much of the excitement around generative ai focuses on models, a complete solution involves people, skills, and tools from multiple domains. Consider the following image, which is an AWS view of the a16z emerging application stack for large language models (LLM).
Compared to a more traditional solution based on ai and machine learning (ML), a generative ai solution now involves the following:
- New roles – You should consider both model tuners and model builders and integrators.
- New tools – The traditional MLOps stack does not extend to cover the type of experiment tracking or observability needed for rapid engineering or agents that invoke tools to interact with other systems.
Agent's Reasoning
Unlike traditional ai models, retrieval augmented generation (RAG) enables more accurate and contextually relevant responses by integrating external knowledge sources. The following are some considerations when using RAG:
- Setting appropriate wait times is important for the customer experience. Nothing indicates a bad user experience better than being in the middle of a chat and disconnecting.
- Be sure to validate the prompt input data and prompt input size for the allocated character limits defined by your model.
- If you are doing notice engineering, you should persist your notices in a trusted data store. This will protect your indications in the event of accidental loss or as part of your overall disaster recovery strategy.
Data Pipelines
In cases where you need to provide contextual data to the base model using the RAG pattern, you need a data pipeline that can ingest the source data, convert it to embedding vectors, and store the embedding vectors in a vector database. This pipeline could be a batch pipeline if you prepare contextual data in advance, or a low-latency pipeline if you bring in new contextual data on the fly. In the case of batch, there are a couple of challenges compared to typical data pipelines.
Data sources can be PDF documents in a file system, data from a software-as-a-service (SaaS) system such as a CRM tool, or data from an existing wiki or knowledge base. Ingesting these sources is different from typical data sources, such as log data in an Amazon Simple Storage Service (Amazon S3) bucket or structured data in a relational database. The level of parallelism you can achieve may be limited by the source system, so you must be aware of the limitation and use fallback techniques. Some of the source systems may be fragile, so error handling and retry logic need to be built in.
The built-in model could be a performance bottleneck, regardless of whether you run it locally in the pipeline or call an external model. Integrated models are basic models that run on GPU and do not have unlimited capacity. If the model is running locally, you will need to allocate work based on GPU capacity. If the model is run externally, you must ensure that you do not overwhelm the external model. In any case, the level of parallelism you can achieve will be dictated by the integration model and not by the amount of CPU and RAM you have available in the batch processing system.
In the case of low latency, it is necessary to take into account the time it takes to generate the embedding vectors. The calling application must invoke the pipeline asynchronously.
Vector databases
A vector database has two functions: storing embedding vectors and running a similarity search to find the closest ones. k matches a new vector. There are three general types of vector databases:
- Dedicated SaaS options like Pinecone.
- Vector database functions integrated into other services. This includes native AWS services such as Amazon OpenSearch Service and Amazon Aurora.
- In-memory options that can be used for transient data in low latency scenarios.
We do not cover similarity search capabilities in detail in this post. Although important, they are a functional aspect of the system and do not directly affect resilience. Instead, we focus on the resilience aspects of a vector database as a storage system:
- Latency – Can the vector database perform well under high or unpredictable load? Otherwise, the calling application must handle the rate limiting and roll back and try again.
- Scalability – How many vectors can the system contain? If you exceed the capacity of the vector database, you will need to look at fragmentation or other solutions.
- High availability and disaster recovery – Embedding vectors are valuable data and recreating them can be costly. Is your vector database highly available in a single AWS Region? Do you have the ability to replicate data to another region for disaster recovery purposes?
Application level
There are three unique application-level considerations when integrating generative ai solutions:
- Potentially high latency – Foundation models often run on large GPU instances and can have finite capacity. Be sure to use best practices for speed limiting, backing off and retrying, and load shedding. Use asynchronous designs so that high latency does not interfere with the main interface of the application.
- Safety posture – If you use agents, tools, plugins, or other methods to connect a model to other systems, pay special attention to your security posture. Models may attempt to interact with these systems in unexpected ways. Follow normal least privilege access practice, for example by restricting incoming prompts from other systems.
- Rapidly evolving frameworks – Open source frameworks like LangChain are evolving rapidly. Use a microservices approach to isolate other components from these less mature frameworks.
Ability
We can think of capacity in two contexts: inference and training model data pipelines. Capacity is a consideration when organizations are building their own channels. CPU and memory requirements are two of the biggest requirements when choosing instances to run your workloads.
Instances that can support generative ai workloads may be more difficult to obtain than the average general-purpose instance type. Instance flexibility can help with capacity and capacity planning. Depending on the AWS Region you are running your workload in, there are different instance types available.
For user journeys that are critical, organizations will want to consider reserving or pre-provisioning instance types to ensure availability when needed. This pattern achieves a statically stable architecture, which is a resiliency best practice. For more information about static stability in the reliability pillar of the AWS Well-Architected Framework, see Using Static Stability to Avoid Bimodal Behavior.
Observability
In addition to the resource metrics you typically collect, such as CPU and RAM utilization, you should closely monitor GPU utilization if you host a model in Amazon SageMaker or Amazon Elastic Compute Cloud (Amazon EC2). GPU utilization can change unexpectedly if the base model or input data changes, and running out of GPU memory can put the system in an unstable state.
Further up the stack, you'll also want to trace the flow of calls through the system, capturing interactions between agents and tools. Because the interface between agents and tools is defined less formally than an API contract, you should monitor these traces not only to verify performance but also to capture new error scenarios. To monitor your model or agent for security risks and threats, you can use tools like Amazon GuardDuty.
It should also capture baselines of onboarding vectors, cues, context, and outcomes, and the interactions between these. If these change over time, it may indicate that users are using the system in new ways, that the reference data does not cover the question space in the same way, or that the model output is suddenly different.
Disaster recovery
Having a business continuity plan with a disaster recovery strategy is a must for any workload. Generative ai workloads are no different. Understanding the failure modes that are applicable to your workload will help guide your strategy. If you use AWS managed services for your workload, such as Amazon Bedrock and SageMaker, make sure the service is available in your AWS Recovery Region. As of this writing, these AWS services do not natively support data replication between AWS Regions, so you need to think about your data management strategies for disaster recovery and may need to make adjustments as well. in multiple AWS regions.
Conclusion
This post outlined how to take resilience into account when building generative ai solutions. Although generative ai applications have some interesting nuances, existing resilience patterns and best practices still apply. It's just a matter of evaluating each part of a generative ai application and applying relevant best practices.
To learn more about Generative ai and its use with AWS services, see the following resources:
About the authors
Jennifer Moran is an AWS Resiliency Specialist Senior Solutions Architect based in New York City. She has a diverse background, has worked in many technical disciplines, including software development, agile leadership, and DevOps, and is an advocate for women in technology. She enjoys helping clients design resilient solutions to improve resilience posture and speaks publicly on all topics related to resilience.
Randy DeFauw He is a Senior Principal Solutions Architect at AWS. He has an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also has an MBA from Colorado State University. Randy has held various positions in technology, from software engineering to product management. He entered the big data space in 2013 and continues to explore that area. He actively works on projects in the ML space and has presented at numerous conferences including Strata and GlueCon.