Model Quantization | Primer

References: - Model Quantization: Concepts, Methods, and Why It Matters

Quantization Algorithms

Quantization maps a float number to a low-precision value .

Affine/Asymmetric Quantization: Defined by scale factor and the zero-point . Scale factor defines the step size of the quantizer. And the zero-point is the same type with , mapping the real value zero exactly during quantization. The quantization process can written as:

where: - round converts the scaled value to the nearest quantized representation. - clip ensures the scaled value stay within the range of quantized representation.

To recover the full-precision from a quantized value:

As we can see that round and clip bring errors naturally, which are inherent to the quantization process.

Symmetric Quantization: Fixing the zero-point , symmetric quantization reduces additional computation overhead by eliminating the addition operators. The process of quantization and de-quantization can be written as:

NOTE: The mostly used quantization algrithm is symmetric quantization as the affine quantization does not offer a significant boost on model performances. NVIDIA TensorRT and Model Optimizer use symmetric quantization.