Quantization: Reducing Numerical Precision in Neural Network Weights and Activations

Category: inference Updated: 2026-02-27

LLM.int8() (Dettmers et al., NeurIPS 2022): mixed-precision INT8 with FP16 for 0.1% outlier features enables inference at 8-bit with no accuracy degradation; GPTQ (Frantar et al., ICLR 2023): Hessian-compensated INT4 achieves <1% perplexity increase on 175B-scale models.

Key Data Points
MeasureValueUnitNotes
INT8 vs FP32 memory reductioncompression ratioFP32 = 32 bits per weight; INT8 = 8 bits; 4× fewer bits; 7B model: 28 GB → 7 GB
INT4 vs FP32 memory reductioncompression ratioINT4 = 4 bits; 8× compression; GPTQ achieves this with <1% perplexity degradation at large scale
LLM.int8() outlier feature fraction~0.1%% of activation dimensionsDettmers et al.: ~0.1% of features cause activation magnitudes >6σ; kept in FP16 precision
GPTQ INT4 perplexity increase<1%relative perplexity increaseFrantar et al. (2022): 175B model GPTQ INT4 shows minimal perplexity degradation vs FP16
INT8 inference throughput gain1.5–2×throughput multiplierPractical speedup vs FP16 on GPUs with INT8 tensor cores (A100, H100); memory-bandwidth bottleneck reduced

Quantization reduces the numerical precision of neural network weights and activations from floating-point formats (FP32/FP16/BF16) to lower-precision integers (INT8 or INT4). The primary goals are reducing memory footprint and increasing inference throughput, enabling deployment of large language models on hardware with limited memory bandwidth.

Precision Formats and Memory

FormatBitsRange7B Model Memory
FP3232±3.4×10³⁸28 GB
FP16 / BF1616±65,50414 GB
INT88−128 to 1277 GB
INT44−8 to 73.5 GB
INT22−2 to 11.75 GB (unusable quality)

Quantization Methods

Round-to-Nearest (RTN)

Simplest approach: round each weight w to nearest quantization level. Works adequately for INT8; significant accuracy loss at INT4 for models below ~30B parameters.

GPTQ (Frantar et al., 2022)

Layer-wise Hessian-compensated quantization:

  1. Compute approximate Hessian H of layer output w.r.t. weights
  2. Quantize weights column by column; after rounding weight w_i, update remaining weights: Δw = −(w_q − w) / H_{ii} · H_{i,−i}
  3. Achieves INT4 with <1% perplexity increase at 175B scale

AWQ — Activation-Aware Quantization (Lin et al., 2023)

Key insight: ~1% of weights are “salient” — they process activation values with large magnitudes. AWQ scales these salient weights before quantization, preserving their effective precision without storing them separately in FP16.

LLM.int8(): Mixed-Precision Decomposition

Dettmers et al. (2022) discovered that a small fraction of activation dimensions produce extreme outlier values (>6σ), which cannot be accurately represented in INT8.

ComponentPrecisionFraction
~99.9% of weight dimensionsINT8Standard quantization
~0.1% outlier feature dimensionsFP16Activation magnitudes exceed INT8 range
Matrix multiply outputFP16Accumulated from INT8 + FP16 streams

This mixed-precision scheme achieves no accuracy degradation at 8-bit while enabling inference on GPUs with 4× less memory than FP32.

KV Cache Quantization

The kv-cache can also be quantized: INT8 KV cache halves the memory required for cached attention states, allowing longer sequences or larger batch sizes within the same GPU memory budget.

See inference-vs-training-compute for broader inference cost context, knowledge-distillation for a complementary compression approach, and kv-cache for the attention memory that quantization also targets.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why does quantization work without catastrophic accuracy loss?

Neural networks are highly over-parameterized — weight values are clustered and individually encode little information. Rounding each weight to the nearest of 256 levels (INT8) or 16 levels (INT4) introduces quantization error, but this error is distributed across all weights and is often smaller than training noise. Large models (>7B parameters) are empirically more robust to quantization than small models: the error per parameter is diluted across more redundant representations. GPTQ further compensates by updating remaining weights when each weight is quantized.

What are the two main post-training quantization approaches?

Round-to-nearest (RTN) rounds each weight to the nearest quantization level — fast but loses significant accuracy at INT4 for smaller models. GPTQ (Frantar et al., 2022) uses approximate second-order Hessian information: when one weight is rounded, other weights in the same layer are updated to compensate for the output error introduced. This layer-wise compensation achieves INT4 quality close to FP16, whereas RTN degrades noticeably at 4 bits. AWQ (Lin et al., 2023) identifies ~1% of 'salient' weights and scales them before quantization to reduce their error.

← All AI pages · Dashboard