Build Your Own ChatGPT-like Chatbot with Java and Python | by Daniel García Solla | May, 2024

Creating a custom LLM inference infrastructure from scratch

Introduction

In recent years, amazon.com/what-is/large-language-model/” rel=”noopener ugc nofollow” target=”_blank”>Large Language Models (LLMs) have emerged as a game-changing technology that has revolutionized the way we interact with machines. These models, represented by OpenAI’s GPT series with examples such as GPT-3.5 or GPT-4, can take a sequence of input text and generate coherent, contextually relevant, and human-sounding text in reply. Thus, its applications are wide-ranging and cover a variety of fields, such as customer service, content creation, language translation, or code generation. However, at the core of these capabilities are advanced machine-learning/statistical techniques, including attention mechanisms for improving the natural language understanding process, transfer learning to provide foundational models at scale, data augmentation, or even Reinforcement Learning From Human Feedback, which enable these systems to extend their training process and improve their performance continuously along inference.

As a subset of artificial intelligence, machine learning is responsible for processing datasets to identify patterns and develop models that accurately represent the data’s nature. This approach generates valuable knowledge and unlocks a variety of tasks, for example, content generation, underlying the field of Generative ai that drives large language models. It is worth highlighting that this field is not solely focused on natural language, but also on any type of content susceptible to being generated. From audio, with models capable of generating sounds, voices, or music; videos through the latest models like OpenAI’s SORA; or images, as well as editing and style transfer from text sequences. The latter data format is especially valuable, since by using multimodal integration and the help of image/text embedding technologies, it is possible to effectively illustrate the potential of knowledge representation through natural language.

Nevertheless, creating and maintaining models to perform this kind of operation, particularly at a large scale, is not an easy job. One of the main reasons is data, as it represents the major contribution to a well-functioning model. That is, training a model with a structurally optimal architecture and high-quality data will produce valuable results. Conversely, if the provided data is poor, the model will produce misleading outputs. Therefore, when creating a dataset, it should contain an appropriate volume of data for the particular model architecture. This requirement complicates data treatment and quality verification, in addition to the potential legal and privacy issues that must be considered if the data is collected by automation or scraping.

Another reason lies in hardware. Modern deployed models, which need to process vast amounts of data from many users simultaneously, are not only large in size but also require substantial computing resources for performing inference tasks and providing quality service to their clients. That is reflected in equally significant costs in economic terms. On the one hand, setting up servers and data centers with the right hardware, considering that to provide a reliable service you need GPUs, TPUs, DPUs, and carefully selected components to maximize efficiency, is incredibly expensive. On the other hand, its maintenance requires skilled human resources — qualified people to solve potential issues and perform system upgrades as needed.

Objective

There are many other issues surrounding the construction of this kind of model and its large-scale deployment. Altogether, it is difficult to build a system with a supporting infrastructure robust enough to match leading services on the market like ChatGPT. Still, we can achieve rather acceptable and reasonable approximations to the reference service due to the wide range of open-source content and technologies available in the public domain. Moreover, given the high degree of progress presented in some of them, they prove to be remarkably simple to use, allowing us to benefit from their abstraction, modularity, ease of integration, and other valuable qualities that enhance the development process.

Therefore, the purpose of this article is to show how we can design, implement, and deploy a computing system for supporting a ChatGPT-like service. Although the eventual result may not have the expected service capabilities, using high-quality dependencies and development tools, as well as a good architectural design, guarantees the system to be easily scalable up to the desired computing power according to the user’s needs. That is, the system will be prepared to run on very few machines, possibly as few as one, with very limited resources, delivering a throughput consistent with those resources, or on larger computer networks with the appropriate hardware, offering an extended service.

Initially, the primary system functionality will be to allow a client to submit a text query, which is processed by an LLM model and then returned to the source client, all within a reasonable timeframe and offering a fair quality of service. This is the highest level description of our system, specifically, of the application functionality provided by this system, since all the implementation details like communication protocols between components, data structures involved, etc. are being intentionally omitted. But, now that we have a clear objective to reach, we can begin a decomposition that gradually increases the detail involved in solving the problem, often referred to as Functional Decomposition. Thus, starting from a black-box system (abstraction) that receives and returns queries, we can begin to comprehensively define how a client interacts with the system, together with the technologies that will enable this interaction.

At first, we must determine what constitutes a client, in particular, what tools or interfaces the user will require to interact with the system. As illustrated above, we assume that the system is currently a fully implemented and operational functional unit; allowing us to focus on clients and client-system connections. In the client instance, the interface will be available via a website, designed for versatility, but primarily aimed at desktop devices. A mobile app could also be developed and integrated to use the same system services and with a specific interface, but from an abstract point of view, it is desirable to unify all types of clients into one, namely the web client.

Subsequently, it is necessary to find a way to connect a client with the system so that an exchange of information, in this case, queries, can occur between them. At this point, it is worth being aware that the web client will rely on a specific technology such as JavaScript, with all the communication implications it entails. For other types of platforms, that technology will likely change, for example to Java in mobile clients or C/C++ in IoT devices, and compatibility requirements may demand the system to adapt accordingly.

One way to establish communication would be to use Sockets and similar tools at a lower level, allowing exhaustive control of the whole protocol. However, this option would require meeting the compatibility constraints described above with all client technologies, as the system will need to be able to collect queries from all available client types. Additionally, having exhaustive control implies a lengthier and potentially far more complex development, since many extra details must be taken into account, which significantly increases the number of lines of code and complicates both its maintainability and extensibility.

As you can see above, the most optimal alternative is to build an Application Programming Interface (API) that intermediates between the clients and the system part in charge of the computing, i.e. the one that solves queries. The main advantage of using an API is that all the internal connection handling, such as opening and closing sockets, thread pooling, and other important details (data serialization), is performed by the framework from which the API is built. In this way, we ensure that the client will only have to send its query to the server where the API is executed and wait for its response, all of this relying on dependencies that simplify the management of these API requests. Another benefit derived from the previous point is the ease of service extension by modifying the API endpoints. For example, if we want to add a new model to the system or any other functionality, it is enough to add and implement a new endpoint, without having to change the communication protocol itself or the way a client interacts with the system.

Compute Service

Once we set up a mechanism for clients to communicate elegantly with the system, we must address the problem of how to process incoming queries and return them to their corresponding clients in a reasonable amount of time. But first, it is relevant to point out that when a query arrives at the system, it must be redirected to a machine with an LLM loaded in memory with its respective inference pipeline and traverse the query through that pipeline, obtaining the result text (LLM answer) that will be later returned. Consequently, the inference process cannot be distributed among several machines for a query resolution. With that in mind, we can begin the design of the infrastructure that will support the inference process.

In the previous image, the compute service was represented as a single unit. If we consider it as a machine connected, this time, through a single channel using Sockets with the API server, we will be able to redirect all the API queries to that machine, concentrating all the system load in a single place. As you can imagine, this would be a good choice for a home system that only a few people will use. However, in this case, we need a way to make this approach scalable, so that with an increase in computing resources we can serve as many additional users as possible. But first, we must segment the previously mentioned computational resources into units. In this way, we will have a global vision of their interconnection and will be able to optimize our project throughput by changing their structure or how they are composed.

A computational unit, which from now on we will call node for the convenience of its implementation, will be integrated by a physical machine that receives requests (not all of them) needing to be solved. Additionally, we can consider a node as virtualization of a (possibly reduced) amount of machines, with the purpose of increasing the total throughput per node by introducing parallelism locally. Regarding the hardware employed, it will depend to a large extent on how the service is oriented and how far we want to go. Nevertheless, for the version presented in this case, we will assume a standard CPU, a generous amount of RAM to avoid problems when loading the model or forwarding queries, and dedicated processors as GPUs, with the possibility of including TPUs in some specific cases.

Now, we can establish a network that links multiple nodes in such a way that via one of them, connected to the API server, queries can be distributed throughout the network, leveraging optimally all the system’s resources. Above, we can notice how all the nodes are structurally connected in a tree-like shape, with its root being responsible for collecting API queries and forwarding them accordingly. The decision of how they should be interconnected depends considerably on the exact system’s purpose. In this case, a tree is chosen for simplicity of the distribution primitives. For example, if we wanted to maximize the number of queries transmitted between the API and nodes, there would have to be several connections from the API to the root of several trees, or another distinct data structure if desired.

Lastly, we need to define how a query is forwarded and processed when it reaches the root node. As before, there are many available and equally valid alternatives. However, the algorithm we will follow will also serve to understand why a tree structure is chosen to connect the system nodes.

Since a query must be solved on a single node, the goal of the distribution algorithm will be to find an idle node in the system and assign it the input query for its resolution. As can be seen above, if we consider an ordered sequence of queries numbered in natural order (1 indexed), each number corresponds to the edge connected with the node assigned to solve that query. To understand the numbering in this concrete example, we can assume that the queries arriving at a node take an infinite time to be solved, therefore ensuring that each node is progressively busy facilitates the understanding of the algorithm heuristic.

In short, we will let the root not to perform any resolution processing, reserving all its capacity for the forwarding of requests with the API. For any other node, when it receives a query from a hierarchically superior node, the first step is to check if it is performing any computation for a previous query; if it is idle, it will resolve the query, and in the other case it will forward it by Round Robin to one of its descendant nodes. With Round Robin, each query is redirected to a different descendant for each query, traversing the entire descendant list as if it were a circular buffer. This implies that the local load of a node can be evenly distributed downwards, while efficiently leveraging the resources of each node and our ability to scale the system by adding more descendants.

Finally, if the system is currently serving many users, and a query arrives at a leaf node that is also busy, it will not have any descendants for redirecting it to. Therefore, all nodes will have a query queuing mechanism in which they will wait in these situations, being able to apply batch operations between queued queries to accelerate LLM inference. Additionally, when a query is completed, to avoid overloading the system by forwarding it upwards until it arrives at the tree top, it is sent directly to the root, subsequently reaching the API and client. We could connect all nodes to the API, or implement other alternatives, however, to keep the code as simple and the system as performant as possible, they will all be sent to the root.

After having defined the complete system architecture and how it will perform its task, we can begin to build the web client that users will need when interacting with our solution.

Build Your Own ChatGPT-like Chatbot with Java and Python | by Daniel García Solla | May, 2024

Lookup, Bind, and Unbind

Model Performance

Kotlin Mobile Client

Technical Terrence Team

Duluth Loss Widens, Eliminates Top-Line Earnings Per Share and Sales Guidance (DLTH)

Leave a Reply Cancel reply

Recommended.

Government Struggles with Software Development Outsourcing

Uncover hidden connections in unstructured financial data with Amazon Bedrock and Amazon Neptune

Stocks slip lower with Apple event, 10-year bond auction in focus

Yuga Labs Prepares To Break Into The Other Side NFT Plazas

XP shares rise after beating second-quarter earnings expectations on strong net inflows (NASDAQ:XP)

Categories

Important Links

Build Your Own ChatGPT-like Chatbot with Java and Python | by Daniel García Solla | May, 2024

Creating a custom LLM inference infrastructure from scratch

Introduction

Objective

Compute Service

Lookup, Bind, and Unbind

Model Performance

Kotlin Mobile Client

Related

Technical Terrence Team

Duluth Loss Widens, Eliminates Top-Line Earnings Per Share and Sales Guidance (DLTH)

Leave a Reply Cancel reply

Recommended.

Government Struggles with Software Development Outsourcing

Uncover hidden connections in unstructured financial data with Amazon Bedrock and Amazon Neptune

Stocks slip lower with Apple event, 10-year bond auction in focus

Yuga Labs Prepares To Break Into The Other Side NFT Plazas

XP shares rise after beating second-quarter earnings expectations on strong net inflows (NASDAQ:XP)

Categories

Important Links

Get daily news updates to your inbox!