Quantization: Reducing Numerical Precision in Neural Network Weights and Activations
LLM.int8() (Dettmers et al., NeurIPS 2022): mixed-precision INT8 with FP16 for 0.1% outlier features enables inference at 8-bit with no accuracy degradation; GPTQ (Frantar et al., ICLR 2023): Hessian-compensated INT4 achieves <1% perplexity increase on 175B-scale models.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| INT8 vs FP32 memory reduction | 4× | compression ratio | FP32 = 32 bits per weight; INT8 = 8 bits; 4× fewer bits; 7B model: 28 GB → 7 GB |
| INT4 vs FP32 memory reduction | 8× | compression ratio | INT4 = 4 bits; 8× compression; GPTQ achieves this with <1% perplexity degradation at large scale |
| LLM.int8() outlier feature fraction | ~0.1% | % of activation dimensions | Dettmers et al.: ~0.1% of features cause activation magnitudes >6σ; kept in FP16 precision |
| GPTQ INT4 perplexity increase | <1% | relative perplexity increase | Frantar et al. (2022): 175B model GPTQ INT4 shows minimal perplexity degradation vs FP16 |
| INT8 inference throughput gain | 1.5–2× | throughput multiplier | Practical speedup vs FP16 on GPUs with INT8 tensor cores (A100, H100); memory-bandwidth bottleneck reduced |
Quantization reduces the numerical precision of neural network weights and activations from floating-point formats (FP32/FP16/BF16) to lower-precision integers (INT8 or INT4). The primary goals are reducing memory footprint and increasing inference throughput, enabling deployment of large language models on hardware with limited memory bandwidth.
Precision Formats and Memory
| Format | Bits | Range | 7B Model Memory |
|---|---|---|---|
| FP32 | 32 | ±3.4×10³⁸ | 28 GB |
| FP16 / BF16 | 16 | ±65,504 | 14 GB |
| INT8 | 8 | −128 to 127 | 7 GB |
| INT4 | 4 | −8 to 7 | 3.5 GB |
| INT2 | 2 | −2 to 1 | 1.75 GB (unusable quality) |
Quantization Methods
Round-to-Nearest (RTN)
Simplest approach: round each weight w to nearest quantization level. Works adequately for INT8; significant accuracy loss at INT4 for models below ~30B parameters.
GPTQ (Frantar et al., 2022)
Layer-wise Hessian-compensated quantization:
- Compute approximate Hessian H of layer output w.r.t. weights
- Quantize weights column by column; after rounding weight w_i, update remaining weights: Δw = −(w_q − w) / H_{ii} · H_{i,−i}
- Achieves INT4 with <1% perplexity increase at 175B scale
AWQ — Activation-Aware Quantization (Lin et al., 2023)
Key insight: ~1% of weights are “salient” — they process activation values with large magnitudes. AWQ scales these salient weights before quantization, preserving their effective precision without storing them separately in FP16.
LLM.int8(): Mixed-Precision Decomposition
Dettmers et al. (2022) discovered that a small fraction of activation dimensions produce extreme outlier values (>6σ), which cannot be accurately represented in INT8.
| Component | Precision | Fraction |
|---|---|---|
| ~99.9% of weight dimensions | INT8 | Standard quantization |
| ~0.1% outlier feature dimensions | FP16 | Activation magnitudes exceed INT8 range |
| Matrix multiply output | FP16 | Accumulated from INT8 + FP16 streams |
This mixed-precision scheme achieves no accuracy degradation at 8-bit while enabling inference on GPUs with 4× less memory than FP32.
KV Cache Quantization
The kv-cache can also be quantized: INT8 KV cache halves the memory required for cached attention states, allowing longer sequences or larger batch sizes within the same GPU memory budget.
See inference-vs-training-compute for broader inference cost context, knowledge-distillation for a complementary compression approach, and kv-cache for the attention memory that quantization also targets.
Related Pages
Sources
- Dettmers et al. (2022) — LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022
- Frantar et al. (2022) — GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023
- Lin et al. (2023) — AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. MLSys 2024
Frequently Asked Questions
Why does quantization work without catastrophic accuracy loss?
Neural networks are highly over-parameterized — weight values are clustered and individually encode little information. Rounding each weight to the nearest of 256 levels (INT8) or 16 levels (INT4) introduces quantization error, but this error is distributed across all weights and is often smaller than training noise. Large models (>7B parameters) are empirically more robust to quantization than small models: the error per parameter is diluted across more redundant representations. GPTQ further compensates by updating remaining weights when each weight is quantized.
What are the two main post-training quantization approaches?
Round-to-nearest (RTN) rounds each weight to the nearest quantization level — fast but loses significant accuracy at INT4 for smaller models. GPTQ (Frantar et al., 2022) uses approximate second-order Hessian information: when one weight is rounded, other weights in the same layer are updated to compensate for the output error introduced. This layer-wise compensation achieves INT4 quality close to FP16, whereas RTN degrades noticeably at 4 bits. AWQ (Lin et al., 2023) identifies ~1% of 'salient' weights and scales them before quantization to reduce their error.