Quantization: Reducing Numerical Precision in Neural Network Weights and Activations

Name: Quantization: Reducing Numerical Precision in Neural Network Weights and Activations
Creator: AI Tower
Published: 2026-02-27

Category: inference Updated: 2026-02-27

LLM.int8() (Dettmers et al., NeurIPS 2022): mixed-precision INT8 with FP16 for 0.1% outlier features enables inference at 8-bit with no accuracy degradation; GPTQ (Frantar et al., ICLR 2023): Hessian-compensated INT4 achieves <1% perplexity increase on 175B-scale models.

Key Data Points
Measure	Value	Unit	Notes
INT8 vs FP32 memory reduction	4×	compression ratio	FP32 = 32 bits per weight; INT8 = 8 bits; 4× fewer bits; 7B model: 28 GB → 7 GB
INT4 vs FP32 memory reduction	8×	compression ratio	INT4 = 4 bits; 8× compression; GPTQ achieves this with <1% perplexity degradation at large scale
LLM.int8() outlier feature fraction	~0.1%	% of activation dimensions	Dettmers et al.: ~0.1% of features cause activation magnitudes >6σ; kept in FP16 precision
GPTQ INT4 perplexity increase	<1%	relative perplexity increase	Frantar et al. (2022): 175B model GPTQ INT4 shows minimal perplexity degradation vs FP16
INT8 inference throughput gain	1.5–2×	throughput multiplier	Practical speedup vs FP16 on GPUs with INT8 tensor cores (A100, H100); memory-bandwidth bottleneck reduced

Quantization reduces the numerical precision of neural network weights and activations from floating-point formats (FP32/FP16/BF16) to lower-precision integers (INT8 or INT4). The primary goals are reducing memory footprint and increasing inference throughput, enabling deployment of large language models on hardware with limited memory bandwidth.

Precision Formats and Memory

Format	Bits	Range	7B Model Memory
FP32	32	±3.4×10³⁸	28 GB
FP16 / BF16	16	±65,504	14 GB
INT8	8	−128 to 127	7 GB
INT4	4	−8 to 7	3.5 GB
INT2	2	−2 to 1	1.75 GB (unusable quality)

Quantization Methods

Round-to-Nearest (RTN)

Simplest approach: round each weight w to nearest quantization level. Works adequately for INT8; significant accuracy loss at INT4 for models below ~30B parameters.

GPTQ (Frantar et al., 2022)

Layer-wise Hessian-compensated quantization:

Compute approximate Hessian H of layer output w.r.t. weights
Quantize weights column by column; after rounding weight w_i, update remaining weights: Δw = −(w_q − w) / H_{ii} · H_{i,−i}
Achieves INT4 with <1% perplexity increase at 175B scale

AWQ — Activation-Aware Quantization (Lin et al., 2023)

Key insight: ~1% of weights are “salient” — they process activation values with large magnitudes. AWQ scales these salient weights before quantization, preserving their effective precision without storing them separately in FP16.

LLM.int8(): Mixed-Precision Decomposition

Dettmers et al. (2022) discovered that a small fraction of activation dimensions produce extreme outlier values (>6σ), which cannot be accurately represented in INT8.

Component	Precision	Fraction
~99.9% of weight dimensions	INT8	Standard quantization
~0.1% outlier feature dimensions	FP16	Activation magnitudes exceed INT8 range
Matrix multiply output	FP16	Accumulated from INT8 + FP16 streams

This mixed-precision scheme achieves no accuracy degradation at 8-bit while enabling inference on GPUs with 4× less memory than FP32.

KV Cache Quantization

The kv-cache can also be quantized: INT8 KV cache halves the memory required for cached attention states, allowing longer sequences or larger batch sizes within the same GPU memory budget.

See inference-vs-training-compute for broader inference cost context, knowledge-distillation for a complementary compression approach, and kv-cache for the attention memory that quantization also targets.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Why does quantization work without catastrophic accuracy loss?

Neural networks are highly over-parameterized — weight values are clustered and individually encode little information. Rounding each weight to the nearest of 256 levels (INT8) or 16 levels (INT4) introduces quantization error, but this error is distributed across all weights and is often smaller than training noise. Large models (>7B parameters) are empirically more robust to quantization than small models: the error per parameter is diluted across more redundant representations. GPTQ further compensates by updating remaining weights when each weight is quantized.

What are the two main post-training quantization approaches?

Round-to-nearest (RTN) rounds each weight to the nearest quantization level — fast but loses significant accuracy at INT4 for smaller models. GPTQ (Frantar et al., 2022) uses approximate second-order Hessian information: when one weight is rounded, other weights in the same layer are updated to compensate for the output error introduced. This layer-wise compensation achieves INT4 quality close to FP16, whereas RTN degrades noticeably at 4 bits. AWQ (Lin et al., 2023) identifies ~1% of 'salient' weights and scales them before quantization to reduce their error.

← All AI pages · Dashboard