ChatGPT and large language models (LLMs) are extremely flexible, allowing for the creation of numerous programs. However, the costs associated with LLM API calls might become significant when the application gains popularity and experiences increased traffic levels. When processing many queries, LLM services may also have lengthy wait periods.
To meet this difficulty head-on, researchers have developed GPTCache, a project aimed at creating a semantic cache for storing LLM answers. An open-source GPTCache program can make LLMs faster by caching their output answers. When the response has been requested before and is already stored in a cache, this can drastically cut down on the time it takes to obtain it.
GPTCache is flexible and simple, making it ideal for any application. It’s compatible with many language learning machines (LLMs), such as OpenAI’s ChatGPT.
How does it work?
To function, GPTCache caches the LLM’s final replies. The cache is a memory buffer used to retrieve recently used information quickly. GPTCache initially looks in the cache to determine if the requested response is already stored there whenever a new request is made to the LLM. If the answer can be found in the cache, it will be returned immediately. The LLM will generate the response and add it to the cache if not already there.
GPTCache’s modular architecture makes it simple to implement bespoke semantic caching solutions. Users can tailor their experience with each module by selecting various settings.
The LLM Adapter unifies the APIs and request protocols used by various LLM models by standardizing them on the OpenAI API. Since the LLM Adapter may move between LLM models without requiring a rewrite of the code or familiarity with a new API, it simplifies testing and experimentation.
The Embedding Generator creates embeddings using the requested model to carry out a similarity search. The OpenAI embedding API can be used with the supported models. This is ONNX using the GPTCache/paraphrase-albert-onnx model, the Hugging Face embedding API, the Cohere embedding API, the fastText embedding API, and the SentenceTransformers embedding API.
In Cache Storage, responses from LLMs like ChatGPT are kept until they can be retrieved. When determining whether or not two entities are semantically similar, cached replies are fetched and sent back to the requesting party. GPTCache is compatible with many different database management systems. Users can pick the database that best meets their requirements regarding performance, scalability, and cost of the most commonly supported databases.
Choices for Vector Store: GPTCache includes a Vector Store module, which uses embeddings derived from the original request to identify the K most similar requests. This feature can be used to determine how similar two requests are. In addition, GPTCache supports multiple vector stores, such as Milvus, Zilliz Cloud, and FAISS, and presents a straightforward interface for working with them. Users are provided with a variety of vector store options, any of which may affect GPTCache’s similarity search performance. With its support for various vector stores, GPTCache promises to be adaptable and meet the needs of a wider variety of use cases.
The GPTCache Cache Manager manages the eviction policies for the Cache Storage and Vector Store components. To create room for new data, a replacement policy decides which old data should be removed from the cache when it fills up.
The information for the Similarity Evaluator comes from both the Cache Storage and the Vector Store sections of GPTCache. It compares the input request to requests in the Vector Store using several different approaches. Whether or not a request is served from the cache depends on the degree of similarity. GPTCache offers a unified interface to similar methods and a library of available implementations. GPTCache’s ability to determine cache matches using a variety of similarity algorithms allows it to become adaptable to a large range of use cases and user requirements.
Features and Benefits
- Enhanced responsiveness and speed thanks to a decrease in LLM query latency made possible by GPTCache.
- Cost savings – many thanks to the token- and request-based pricing structure common to many LLM services. GPTCache can cut down on the cost of the service by limiting the number of times the API must be called.
- Increased scalability thanks to GPTCache’s capacity to offload work from the LLM service. As the number of requests you receive grows, this can help you continue to operate at peak efficiency.
- Costs associated with creating an LLM application can be kept to a minimum with the aid of GPTCache. Caching data generated by or mocked up in LLM allows you to test your app without making API requests to the LLM service.
GPTCache can be used in tandem with your chosen application, LLM (ChatGPT), cache store (SQLite, PostgreSQL, MySQL, MariaDB, SQL Server, or Oracle), and vector store (FAISS, Milvus, Ziliz Cloud). The goal of the GPTCache project is to make the most efficient use of language models in GPT-based applications by reusing previously generated replies whenever possible rather than starting from blank each time.
Check out the GitHub and Documentation. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.