References:
Introduction
The goal of quantization a model is apparent, just for deploying a
large model in a resource-constrained environment which maintaining
approximate high performances. To this end, quantization infrastructures
are designed. For example, the commonly used data types includes
BF16,FP16,FP8 and NVFP4 for floating numbers.
They differ from the the number of bits to represent sign, exponent and
mantissa (or significand).
BF16: 1 bit for sign, 8 bits for exponent and 7 bits for mantissa. Share an approximate same value range withFP32by sacrificing the precision. Number range-3.39e-38to3.39e38.FP16: 1 bit for sign, 5 bits for exponent and 10 bits for mantissa. More precise thanBF16by sacrificing the representation range. Number range-65504to65504.FP8: Two types ofFP8are commonly used, one isE4M3and the other isE5M2.NVFP4: Only 1 bit for mantissa and 2 bits for exponent, additional scaling factors are used for different blocks of tensors to extend the representation range.
In total, the representation range can be calculated via
(normalized):
where
- Normalized/Standard number range: the mantissa will add
, forming a value like . The minimum value of exponent is . For example, forBF16, the maximum value is . And the minimum value is . - Denormalized number range: when the exponent is
, the mantissa will add but the exponent keeps to be . TakeBF16as example, the values around zero are to force the values approach the zero slowly. - Zero: When both
and , the number is exactly . - NaN/Inf:
(inBF16) is a specitial value for NaN/Inf.
Quantization are tipically used in three scenes: the model weight quantization, the activation quantization and quantization on KV cache for decoder-only transformers.
Quantization Algorithms
Quantization maps a float number
Affine/Asymmetric Quantization: Defined by scale
factor
where:
roundconverts the scaled value to the nearest quantized representation.clipensures the scaled value stay within the range of quantized representation.
To recover the full-precision from a quantized value:
As we can see that round and clip bring
errors naturally, which are inherent to the quantization process.
Symmetric Quantization: Fixing the zero-point
AbsMax algorithm: A widely used algorithm to
calculate scale factor AbsMax due to its simplicity and effectiveness. For
symmetric quantization, it can be written as:
For affine quantization:
Other Advanced algorithms: The main goal of quantization algorithms is to find a balance between the computational overhead and the model performances, and some advanced algorithms are designed to reduce the errors brought by quantization process.
- Activation-aware Weight Quantization (AWQ)
- Generative Pre-trained Transformer Quantization (GPTQ)
- SmoothQuant
Quantization Granularity
To define which elements of a tensor share the same quantization
parameters, the quantization process can be conducted on different
granularities. That is, we use which elements of a tensor to get our
- Per-Tensor/Per-Layer: All values of a tensor share the same quantization parameters, which is the simplest and most memory-efficient approach, but leading to higher quantization errors especially when the data distribution varies across dimension.
- Per-Channel: Different set of quantization parameters are shared inner the same channel. This method reduces quantization errors by isolating outlier values to their own dimensions, rather than affecting entire values.
- Per-Block/Per-Group: This method divides a tensor into smaller blocks and each has its own set of quantization parameters, which is useful when different regions of the tensor have varying distributions.
Quantization Approaches
In most cases, quantizing the pre-trained model weights is straight forward as they are static values and input data independent. However, quantizing the activation values is more troublesome for their dependency of input values and important ouliers, which will significantly affect the quantization scaling factor.
Post-Training Quantization (PTQ): During PTQ, we add observers to activation we want to quantize and inference the model with representative calibration dataset. The observers return the activation values and we use the same quantization algrithms to determine a scaling factor.
- Weight-only quantization: Given a pre-trained model, the model weights are static and data independent. As the weights are fixed, no additional data is required. We just calculate the scaling factor with/without the zero-point and map the weights to low-precision values.
- Weight and Activation quantization: We use representative
data for both weight and activation quantization as the activation
values are calculated runtime. The process of collecting statistics of
activation values and determine suitable scale factor and zero-point is
called calibration. During calibration, the model weights are
frozen, and the input representative data is used to calculate
quantization parameters.
- Static quantization: Input all calibration data at once and calculate the scale factor and zero-point. And then, these parameters are static after deploying the model.
- Dynamic quantization: No calibration data is required, calculate the scale factor and zero-point at runtime, which are vary for each input data.
Quantization Aware Training (QAT): Designed to offset the accuracy degradation accompanied the quantization process, a quantization aware module is integrated into the model during training to conduct a fake-quantization. To be specific, the fake quantization module mimics the quantized model behaviour with quantization followed by a de-quantization step to introduce the quantization errors explicitly during training. As the model parameters are optimized conditioned on quantization errors, it remains high-performance even after quantization.
NOTE: The non-differentiability problem raises from QAT as the quantization function such as
roundis not differentiable. QAT uses straight through estimator (STE) to approximate the gradients for these functions as identity during backpropogation.