The scale and complexity of LLMs
The incredible capabilities of LLMs are powered by its vast neural networks that are composed of billions of parameters. These parameters are the result of training on large text corpora and are tuned to make the models as accurate and versatile as possible. This level of complexity requires significant computational power for processing and storage.
The accompanying bar chart outlines the number of parameters at different language model scales. As we move from smaller models to larger models, we witness a significant increase in the number of parameters with “small” language models with modest millions of parameters and “large” models with tens of billions of parameters.
However, it is the GPT-4 LLM model with 175 billion parameters that dwarfs the parameter size of other models. GPT-4 not only uses most of the graphics parameters, but also powers the most recognizable generative ai model, ChatGPT. This imposing presence on the graph is representative of other LLMs in its class and shows the requirements needed to power the ai chatbots of the future, as well as the processing power needed to support such advanced ai systems.
The cost of running LLM and quant
Deploying and operating complex models can be expensive due to the need for cloud computing on specialized hardware, such as high end GPU, ai accelerators and continuous energy consumption. Reducing the cost by choosing an on-premise solution can save a lot of money and increase flexibility in hardware options and the freedom to use the system anywhere with a trade-off in maintenance and hiring a trained professional. High costs can make it difficult for small business implementations to train and drive advanced ai. This is where quantization comes in handy.
What is quantization?
Quantization is a technique that reduces the numerical precision of each parameter in a model, thus decreasing its memory footprint. This is similar to compressing a high resolution image to a lower resolution while preserving the essence and most important aspects but with a reduced data size. This approach allows the implementation of LLM with less hardware without a substantial loss of performance.
ChatGPT was trained and is deployed using thousands of NVIDIA DGX systems, millions of dollars in hardware, and tens of thousands more for infrastructure. Quantization can allow for good proofs of concept, or even full implementations with less spectacular (but still high-performance) hardware.
In the following sections, we will discuss the concept of quantification, its methodologies, and its importance in bridging the gap between the highly resource-intensive nature of LLMs and the practical aspects of everyday use of technology. The transformative power of LLMs can become a staple in smaller-scale applications, offering great benefits to a broader audience.
Basics of quantization
Quantizing a large language model refers to the process of reducing the precision of the numerical values used in the model. In the context of neural networks and deep learning models, including large language models, numerical values are usually represented as floating-point numbers with high precision (e.g., 32-bit or 16-bit floating-point format). Read more about Floating point precision here.
Quantization addresses this problem by converting these high-precision floating-point numbers into lower-precision representations, such as 16- or 8-bit integers, to make the model more memory efficient and faster during training and inference, sacrificing precision. As a result, model training and inference require less storage, consume less memory, and can run more quickly on hardware that supports lower-precision calculations.
Types of quantification
To add depth and complexity to the topic, it is essential to understand that quantification can be applied at various stages of the development and deployment life cycle of a model. Each method has its distinct advantages and disadvantages and is selected based on the specific requirements and limitations of the use case.
1. Static quantization
Static quantization is a technique applied during the training phase of an ai model, where weights and activations are quantized to a lower bit precision and applied to all layers. Weights and activations are quantified in advance and remain fixed at all times. Static quantization is great for the known memory requirements of the system on which you plan to deploy the model.
- Advantages of static quantization
- Simplifies implementation planning since quantization parameters are fixed.
- It reduces the size of the model, making it more suitable for edge devices and real-time applications.
- Cons of Static Quantization
- Performance drops are predictable; so certain quantified parts may suffer more from a broad static approach.
- Limited adaptability for static quantization for different input patterns and less robust weight updating.
2. Dynamic quantization
Dynamic quantization involves quantizing weights statically, but activations are quantized on the fly during model inference. Weights are quantized in advance, while activations are quantized dynamically as data passes through the network. This means that the quantization of certain parts of the model is executed with different precisions instead of adopting a fixed quantization by default.
- Advantages of dynamic quantization
- Balances model compression and runtime efficiency without a significant drop in accuracy.
- Useful for models where trigger accuracy is more critical than weight accuracy.
- Cons of dynamic quantization
- Performance improvements are not predictable compared to static methods (but this is not necessarily a bad thing).
- Dynamic computation means more computational overhead and longer training and inference times than the other methods, while still being lighter than without quantization.
3. Post-Training Quantization (PTQ)
In this technique, quantification is incorporated into the training process itself. It involves analyzing the distribution of weights and activations and then mapping these values to a lower bit depth. PTQ is implemented on devices with limited resources, such as peripheral devices and mobile phones. PTQ can be static or dynamic.
- Advantages of PTQ
- It can be applied directly to a previously trained model without the need to retrain.
- Reduces model size and decreases memory requirements.
- Improved inference speeds enabling faster calculations during and after deployment.
- Cons of PTQ
- Potential loss in model accuracy due to approximation of weights.
- Requires careful calibration and adjustment to mitigate quantification errors.
- It may not be optimal for all model types, especially those sensitive to weight accuracy.
4. Quantification Aware Training (QAT)
During training, the model knows the quantization operations that will be applied during inference and the parameters are adjusted accordingly. This allows the model to learn to handle quantization-induced errors.
- Advantages of QAT
- It tends to preserve model accuracy compared to PTQ, since model training takes into account quantization errors during training.
- More robust for precision-sensitive models and better at inferring even at lower precisions.
- Cons of QAT
- Requires retraining the model, resulting in longer training times.
- More computationally intensive as it incorporates quantization error checking.
5. Binary ternary quantization
These methods quantize weights into two values (binary) or three values (ternary), which represents the most extreme form of quantization. Weights are restricted to +1, -1 for binary quantization or +1, 0, -1 for ternary quantization during or after training. This would dramatically reduce the number of possible quantization weight values while still being somewhat dynamic.
- Advantages of binary ternary quantization
- It maximizes model compression and inference speed and has minimal memory requirements.
- Fast inference and quantization calculations enable its utility on underpowered hardware.
- Cons of binary ternary quantization
- High compression and reduced precision result in a significant drop in precision.
- It is not suitable for all types of tasks or data sets and has difficulty with complex tasks.
The benefits and challenges of quantization
Quantizing large language models brings multiple operational benefits. Mainly, it achieves a significant reduction in the memory needs of these models. Our goal for post-quantization models is for the memory footprint to be noticeably smaller. Greater efficiency allows these models to be deployed on platforms with more modest memory capacities, and the decrease in processing power required to run the models once quantized translates directly into higher inference speeds and faster response times that improve reliability. user experience.
On the other hand, quantization can also introduce some loss in the precision of the model, since it involves approximating real numbers. The challenge is to quantify the model without significantly affecting its performance. This can be done by testing model accuracy and completion time before and after quantization with your models to measure effectiveness, efficiency, and precision.
By optimizing the balance between performance and resource consumption, quantization not only expands the accessibility of LLMs but also contributes to more sustainable computing practices.
Original. Republished with permission.
Kevin Vu manages Exxact Corp. Blog and works with many of its talented authors who write about different aspects of deep learning.