To ground our investigation into quantization, it is important to reflect on exactly what we mean by “quantizing” numbers. So far we’ve discussed that through quantization, we take a set of high-precision values and map them to a lower precision in such a way that best preserves their relationships, but we have not zoomed into the mechanics of this operation. Unsurprisingly, we find there are nuances and design choices to be made concerning how we remap values into the quantized space, which vary depending on use case. In this section, we will seek to understand the knobs and levers which guide the quantization process, so that we can better understand the research and equip ourselves to bring educated decision making into our deployments.
Bit Width
Throughout our discussion on quantization, we will refer to the bit widths of the quantized values, which represents the number of bits available to express the value. A bit can only store a binary value of 0 or 1, but sets of bits can have their combinations interpreted as incremental integers. For instance, having 2 bits allows for 4 total combinations ({0, 0}, {0, 1}, {1, 0}, {1, 1}) which can represent integers in the range (0, 3). As we add N bits, we get 2 to the power of N possible combinations, so an 8-bit integer can represent 256 numbers. While unsigned integers will count from zero to the maximum value, signed integers will place zero at the center of the range by interpreting the first bit as the +/- sign. Therefore, an unsigned 8-bit integer has a range of (0, 255), and a signed 8-bit integer spans from (-128, 127).
This fundamental knowledge of how bits represent information will help us to contextualize the numeric spaces that the floating point values get mapped to in the techniques we study, as when we hear that a network layer is quantized to 4 bits, we understand that the destination space has 2 to the power of 4 (16) discrete values. In quantization, these values do not necessarily represent integer values for the quantized weights, and often refer to the indices of the quantization levels — the “buckets” into which the values of the input distribution are mapped. Each index corresponds to a codeword that represents a specific quantized value within the predefined numeric space. Together, these codewords form a codebook, and the values obtained from the codebook can be either floating point or integer values, depending on the type of arithmetic to be performed. The thresholds that define the buckets depend on the chosen quantization function, as we will see. Note that codeword and codebook are general terms, and that in most cases the codeword will be the same as the value returned from the codebook.
Floating-Point, Fixed-Point, and Integer-Only Quantization
Now that we understand bit widths, we should take a moment to touch on the distinctions between floating-point, fixed-point, and integer-only quantization, so that we are clear on their meaning. While representing integers with binary bits is straightforward, operating on numbers with fractional components is a bit more complex. Both floating-point and fixed-point data types have been designed to do this, and selecting between them depends on both on the deployment hardware and desired accuracy-efficiency tradeoff, as not all hardware supports floating-point operations, and fixed-point arithmetic can offer more power efficiency at the cost of reduced numeric range and precision.
Floating-point numbers allocate their bits to represent three pieces of information: the sign, the exponent, and the mantissa, which enables efficient bitwise operations on their representative values. The number of bits in the exponent define the magnitude of the numeric range, and the number of mantissa bits define the level of precision. As one example, the IEEE 754 standard for a 32-bit floating point (FP32) gives the first bit to the sign, 8 bits to the exponent, and the remaining 23 bits to the mantissa. Floating-point values are “floating” because they store an exponent for each individual number, allowing the position of the radix point to “float,” akin to how scientific notation moves the decimal in base 10, but different in that computers operate in base 2 (binary). This flexibility enables precise representation of a wide range of values, especially near zero, which underscores the importance of normalization in various applications.
In contrast, “fixed” point precision does not use a dynamic scaling factor, and instead allocates bits into sign, integer, and fractional (often still referred to as mantissa) components. While this means higher efficiency and power-saving operations, the dynamic range and precision will suffer. To understand this, imagine that you want to represent a number which is as close to zero as possible. In order to do so, you would carry the decimal place out as far as you could. Floating-points are free to use increasingly negative exponents to push the decimal further to the left and provide extra resolution in this situation, but the fixed-point value is stuck with the precision offered by a fixed number of fractional bits.
Integers can be considered an extreme case of fixed-point where no bits are given to the fractional component. In fact, fixed-point bits can be operated on directly as if they were an integer, and the result can be rescaled with software to achieve the correct fixed-point result. Since integer arithmetic is more power-efficient on hardware, neural network quantization research favors integer-only quantization, converting the original float values into integers, rather than the fixed-point floats, because their calculations will ultimately be equivalent, but the integer-only math can be performed more efficiently with less power. This is particularly important for deployment on battery-powered devices, which also often contain hardware that only supports integer arithmetic.
Uniform Quantization
To quantize a set of numbers, we must first define a quantization function Q(r), where r is the real number (weight or activation) to be quantized. The most common quantization function is shown below:
In this formula, Z represents an integer zero-point, and S is the scaling factor. In symmetrical quantization, Z is simply set to zero, and cancels out of the equation, while for asymmetrical quantization, Z is used to offset the zero point, allowing for focusing more of the quantization range on either the positive or negative side of the input distribution. This asymmetry can be extremely useful in certain cases, for example when quantizing post-ReLU activation signals, which contain only positive numbers. The Int(·) function assigns a scaled continuous value to an integer, typically through rounding, but in some cases following more complex procedures, as we will encounter later.
Choosing the correct scaling factor (S) is non-trivial, and requires careful consideration of the distribution of values to be quantized. Because the quantized output space has a finite range of values (or quantization levels) to map the inputs to, a clipping range (α, β) must be established that provides a good fit for the incoming value distribution. The chosen clipping range must strike a balance between not over-clamping extreme input values and not oversaturating the quantization levels by allocating too many bits to the long tails. For now, we consider uniform quantization, where the bucketing thresholds, or quantization steps, are evenly spaced. The calculation of the scaling factor is as follows:
The shapes of trained parameter distributions can vary widely between networks and are influenced by a number of factors. The activation signals generated by those weights are even more dynamic and unpredictable, making any assumptions about the correct clipping ranges difficult. This is why we must calibrate the clipping range based on our model and data. For best accuracy, practitioners may choose to calibrate the clipping range for activations online during inference, known as dynamic quantization. As one might expect, this comes with extra computational overhead, and is therefore by far less popular than static quantization, where the clipping range is calibrated ahead of time, and fixed during inference.
Dequantization
Here we establish the reverse uniform quantization operation which decodes the quantized values back into the original numeric space, albeit imperfectly, since the rounding operation is non-reversible. We can decode our approximate values using the following formula:
Non-Uniform Quantization
The astute reader will probably have noticed that enacting uniformly-spaced bucketing thresholds on an input distribution that is any shape other than uniform will lead to some bits being far more saturated than others, and that adjusting these widths to focus more bits in the denser regions of the distribution would more faithfully capture the nuances of the input signal. This concept has been investigated in the study of non-uniform quantization, and has indeed shown benefits in signal fidelity; however, the hardware-optimized calculations made possible by uniform quantization has made it the de-facto neural network quantization method. The equation below describes the non-uniform quantization process:
Many works in non-uniform quantization refer to learning centroids, which represent the centers of clusters in the input distribution to which the surrounding values are mapped through the quantization process. To think of this another way, in uniform quantization, where the thresholds are evenly spaced on the input distribution, the centroids are simply the values directly in between the bucketing thresholds.
Mixed-Precision Quantization
As we saw with pruning, a trained neural network’s performance is more sensitive to changes in some layers and submodules than others, and by measuring these sensitivities, entire pieces of neural networks can be removed without significantly affecting error. Intuitively, the same is true for varying levels of quantization, with some network components capable of being remapped to much lower bit widths than their counterparts. The most fundamental example of this we already mentioned: the use of 16-bit floats in less-sensitive network operations to substantially reduce memory footprint during training, but mixed-precision quantization can refer to any combination of different quantization levels throughout a network.
Related to the concept of mixed-precision quantization is the granularity of quantization, which might be layer-wise, group-wise, channel-wise, or sub-channel-wise, and describes the scale at which distinct sets of quantization parameters are calibrated. Intuitively, computational overhead increases with granularity, representing an accuracy/efficiency trade-off. For example, in convolutional neural networks (CNNs), channel-wise granularity is often the weapon of choice, since sub-channel-wise (i.e. filter-wise) quantization would be too complex.
Scalar vs. Vector Quantization
While the majority of research in quantization has historically focused on quantizing individual values within the matrices, it is possible to learn multidimensional centroids as well. This means that matrices can be split into vectors, and then each of those vectors can be given a codeword that points to their closest centroid, creating the possibility of recovering entire pieces of the matrix from single codebook lookups, effectively storing a set of numbers into a single value, and greatly increasing compression levels. This is known as Vector Quantization, and the advantages it offers has been attracting increasing interest. “Vector Quantization” typically refers to splitting the matrices into column vectors, but these vectors can be further split into sub-vectors in a practice known as Product Quantization, which generalizes both vector and scalar quantization at its extremes. The idea is that the assembly of centroid vectors returned from the codebook using the relatively small structure of stored codewords will faithfully recreate the original, larger matrix. We will see that this has indeed proven to be a very powerful model compression technique.
Compensating for the Effects of Quantization
It makes sense that we cannot simply round all of the weights in a neural network to various resolutions and expect that things still work properly, so we must come up with a plan for how to compensate for the perturbations caused by the quantization process. As we learned above, it is possible to train or fine-tune models under simulated quantization in order to drastically increase the amount of quantization that can be performed without affecting performance in a technique called Quantization-Aware Training (QAT), which also allows for learning the quantization parameters during training. However, performing QAT requires having the hardware and data necessary to train the model, which is often not possible, particularly for very large models like today’s LLMs. To address this issue, Post-Training Quantization (PTQ) techniques aim to avoid training and require only a small amount of unlabeled data to calibrate the quantization function, and Zero-Shot Quantization (ZSQ) explores the ideal “data-free” scenario which requires no data for calibration.
We will see each these techniques highlighted in more detail as we journey through the literature, so let us now board our temporal tour bus and travel back to the end of the last century, when researchers were being similarly tantalized by the power of neural networks which exceeded their hardware limitations, and first started to consider how we might hope to deploy these complex models on mobile hardware.