Learn how to significantly improve inference latency on CPUs using quantization techniques for mixed, int8, and int4 precisions.
One of the biggest challenges facing the ai space is the need for computing resources to host large-scale production-grade LLM-based applications. At scale, LLM applications require redundancy, scalability, and reliability, which have historically only been possible on general computing platforms such as CPUs. Still, the prevailing narrative today is that CPUs cannot handle LLM inference with latencies comparable to high-end GPUs.
An open source tool in the ecosystem that can help address inference latency challenges on CPUs is Intel Extension for PyTorch (IPEX), which provides updated feature optimizations for additional performance boost on Intel hardware. IPEX offers a variety of easy-to-implement optimizations that use hardware-level instructions. This tutorial will delve into the theory of model compression and the out-of-the-box model compression techniques that IPEX provides. These compression techniques directly impact LLM inference performance on general computing platforms such as 4th and 5th generation Intel CPUs.
After application security, inference latency is one of the most critical parameters of an ai application in production. With respect to LLM-based applications, latency or throughput is often measured in tokens/second. As illustrated in the simplified inference processing sequence below, the language model processes the tokens and then converts them into natural language.
Interpreting inference this way can sometimes lead us astray because we analyze this component of ai applications in an abstraction from the traditional production software paradigm. Yes, ai applications have their nuances, but at the end of the day, we are still talking about transactions per unit of time. If we start thinking about inference as a transaction, like any other, from an application design point of view, the problem becomes less complex. For example, let's say we have a chat application that has the following requirements:
- average of 300 user sessions per hour
- average of 5 transactions (LLM inference requests) per user per session
- Average 100 tokens generated by transaction
- Each session has an average of 10,000 ms (10 s) overload for user authentication, security barriers, network latency, and pre- and post-processing.
- Users take an average of 30,000 ms (30 s) to respond when you actively interact with the chatbot.
- Average total assets The session time goal is 3. minutes or less.
Below you can see that with some simple calculations, we can get some estimates for the required latency of our LLM inference engine.
Reaching required latency thresholds in production is challenging, especially if it needs to be done without incurring additional IT infrastructure costs. In the rest of this article, we will explore a way we can significantly improve inference latency through model compression.
Model compression is a complex term because it addresses a variety of techniques, such as model quantization, distillation, pruning, and more. In essence, the main objective of these techniques is to reduce the computational complexity of neural networks.
The method we'll focus on today is model quantization, which involves reducing the byte precision of weights and sometimes activations, reducing the computational load of matrix operations and the memory load of moving larger values. and greater precision. The following figure illustrates the process of quantizing the weights of fp32 into int8.
It is worth mentioning that the complexity reduction by a factor of 4 that results from quantizing from fp32 (full precision) to int8 (quarter precision) does not result in a 4x latency reduction during inference because the latency Inference involves more factors than the model. central properties.
As with many things, there is no one-size-fits-all approach, and in this article, we'll explore three of my favorite techniques for quantifying models using IPEX:
Mixed precision (bf16/fp32)
This technique quantifies some, but not all, of the neural network weights, resulting in partial compression of the model. This technique is ideal for smaller models, such as the <1B LLMs of the world.
The implementation is quite simple: using face-hugging transformers, a model can be loaded into memory and optimized using the IPEX-specific optimization function llm. ipex.llm.optimize(model, dtype=dtype)
configuring dtype = torch.bfloat16,
We can activate the mixed precision inference capability, which improves inference latency with respect to full precision (fp32) and stock.
import sys
import os
import torch
import intel_extension_for_pytorch as ipex
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline# PART 1: Model and tokenizer loading using transformers
tokenizer = AutoTokenizer.from_pretrained("Intel/neural-chat-7b-v3-3")
model = AutoModelForCausalLM.from_pretrained("Intel/neural-chat-7b-v3-3")
# PART 2: Use IPEX to optimize the model
#dtype = torch.float # use for full precision FP32
dtype = torch.bfloat16 # use for mixed precision inference
model = ipex.llm.optimize(model, dtype=dtype)
# PART 3: Create a hugging face inference pipeline and generate results
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
st = time.time()
results = pipe("A fisherman at sea...", max_length=250)
end = time.time()
generation_latency = end-st
print('generation latency: ', generation_latency)
print(results(0)('generated_text'))
Of the three compression techniques we'll explore, this is the easiest to implement (as measured by single lines of code) and offers the smallest net improvement over an unquantized baseline.
SmoothQuant (int8)
This technique addresses the primary challenges of LLM quantization, which include handling large magnitude outliers in activation channels across layers and tokens, a common problem that traditional quantization techniques struggle to handle effectively. This technique uses a joint mathematical transformation in both weights and activations within the model. The transformation strategically reduces the disparity between outliers and non-outliers of activations, although at the cost of increasing this ratio of weights. This setting makes Transformer layers “quantization-friendly”, allowing the successful application of int8 quantization without degrading model quality.
Below is a simple implementation of SmoothQuant, skipping the code to create the DataLoader, which is a common and well-documented PyTorch principle. SmoothQuant is a post-training quantization recipe that takes into account accuracy, meaning that by providing a data set and a calibration model you will be able to provide a baseline and limit language modeling degradation. The calibration model generates a quantization setup, which is then passed to ipex.llm.optimize()
along with SmoothQuant mapping. After execution, SmoothQuant is applied and the model can be tested using the .generate()
method.
import torch
import intel_extension_for_pytorch as ipex
from intel_extension_for_pytorch.quantization import prepare
import transformers# PART 1: Load model and tokenizer from Hugging Face + Load SmoothQuant config mapping
tokenizer = AutoTokenizer.from_pretrained("Intel/neural-chat-7b-v3-3")
model = AutoModelForCausalLM.from_pretrained("Intel/neural-chat-7b-v3-3")
qconfig = ipex.quantization.get_smooth_quant_qconfig_mapping()
# PART 2: Configure calibration
# prepare your calibration dataset samples
calib_dataset = DataLoader({Your dataloader parameters})
example_inputs = # provide a sample input from your calib_dataset
calibration_model = ipex.llm.optimize(
model.eval(),
quantization_config=qconfig,
)
prepared_model = prepare(
calibration_model.eval(), qconfig, example_inputs=example_inputs
)
with torch.no_grad():
for calib_samples in enumerate(calib_dataset):
prepared_model(calib_samples)
prepared_model.save_qconf_summary(qconf_summary=qconfig_summary_file_path)
# PART 3: Model Quantization using SmoothQuant
model = ipex.llm.optimize(
model.eval(),
quantization_config=qconfig,
qconfig_summary_file=qconfig_summary_file_path,
)
# generation inference loop
with torch.inference_mode():
model.generate({your generate parameters})
SmoothQuant is a powerful model compression technique and helps significantly improve inference latency compared to full-precision models. Still, some initial work is required to prepare a data set and a calibration model.
Weight-only quantization (int8 and int4)
Compared to traditional int8 quantization applied to both activation and weight, weight-only quantization (WOQ) offers a better balance between performance and accuracy. It is worth noting that int4 WOQ requires dequantizing to bf16/fp16 before calculation (Figure 4), which introduces overhead into the calculation. A basic WOQ technique, asymmetric round-to-nearest (RTN) quantization per tensor, presents challenges and often leads to reduced precision (fountain). However, the literature (Zhewei Yao, 2022) suggests that group quantification of model weights helps maintain accuracy. Since the weights are only dequantized for the calculation, a significant memory advantage remains despite this additional step.
The WOQ implementation below shows the few lines of code needed to quantify a Hugging Face model with this technique. As with previous implementations, we start by loading a Hugging Face model and tokenizer. We can use the get_weight_only_quant_qconfig_mapping()
Method to set up the WOQ recipe. The recipe is then passed to ipex.llm.optimize()
They work together with the model for optimization and quantification. The quantified model can then be used to infer with the .generate()
method.
import torch
import intel_extension_for_pytorch as ipex
import transformers# PART 1: Model and tokenizer loading
tokenizer = AutoTokenizer.from_pretrained("Intel/neural-chat-7b-v3-3")
model = AutoModelForCausalLM.from_pretrained("Intel/neural-chat-7b-v3-3")
# PART 2: Preparation of quantization config
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping(
weight_dtype=torch.qint8, # or torch.quint4x2
lowp_mode=ipex.quantization.WoqLowpMode.NONE, # or FP16, BF16, INT8
)
checkpoint = None # optionally load int4 or int8 checkpoint
# PART 3: Model optimization and quantization
model = ipex.llm.optimize(model, quantization_config=qconfig, low_precision_checkpoint=checkpoint)
# PART 4: Generation inference loop
with torch.inference_mode():
model.generate({your generate parameters})
As you can see, WOQ provides a powerful way to compress models to a fraction of their original size with limited impact on the language's modeling capabilities.
As an engineer at Intel, I worked closely with Intel's IPEX engineering team. This has given me a unique insight into its benefits and development roadmap, making IPEX a preferred tool. However, for developers looking for simplicity without the need to manage an additional dependency, PyTorch offers three quantification recipes: Eager Mode, FX Graph Mode (under maintenance), and PyTorch 2 Export Quantization, which provide robust, less specialized alternatives.
Regardless of which technique you choose, model compression techniques will result in some degree of language modeling performance loss, although <1% in many cases. For this reason, it is essential to evaluate the fault tolerance of the application and establish a baseline for model performance at full precision (FP32) and/or half precision (BF16/FP16) before continuing with quantification.
In applications that leverage some degree of in-context learning, such as recall augmented generation (RAG), model compression can be an excellent option. In these cases, mission-critical knowledge is fed into the model at inference time, so risk is greatly reduced even with applications with low fault tolerance.
Quantization is a great way to address LLM inference latency issues without upgrading or expanding computing infrastructure. It's worth exploring regardless of your use case, and IPEX provides a good option to get started with just a few lines of code.
Some interesting things to try would be:
- Try the sample code in this tutorial on Intel Developer Cloud's Free Jupyter environment.
- Take an existing model that is running on an accelerator with full precision and test it on a CPU at int4/int8
- Explore all three techniques and determine which works best for your use case. Be sure to compare language modeling performance loss, not just latency.
- ai-innovation-bridge/blob/master/workshops/ai-workloads-with-huggingface/6%20-%20Uploading%20and%20Sharing%20Models%20on%20Hugging%20Face%20Hub%20with%20Intel%20Optimizations.ipynb” rel=”noopener ugc nofollow” target=”_blank”>Upload your quantized model to Hugging Face Model Hub! If so, please let me know. I'd love to check it out!
Thank you for reading! Don't forget to follow my profile for more articles like this!