Chain-of-Thought Prompting: Intermediate Reasoning Steps Improve Multi-Step Accuracy

Category: agents-applications Updated: 2026-02-27

Wei et al. (NeurIPS 2022): adding step-by-step reasoning to 8-shot examples raised PaLM 540B GSM8K accuracy 18% → 57%; Kojima et al. (2022): zero-shot CoT 'Let's think step by step' raised MultiArith 17.7% → 78.7%; self-consistency (Wang et al., 2022) adds +17% via majority vote.

Key Data Points
MeasureValueUnitNotes
GSM8K: standard vs CoT (PaLM 540B)18% → 57%% accuracyWei et al. (2022): 8-shot standard prompting vs 8-shot chain-of-thought; +39 percentage points
MultiArith: zero-shot CoT (540B)17.7% → 78.7%% accuracyKojima et al. (2022): zero-shot standard vs 'Let's think step by step'; +61 percentage points
Self-consistency gain on GSM8K57% → 74% (k=40 samples)% accuracyWang et al. (2022): majority vote over 40 CoT samples; PaLM 540B; +17 percentage points
Scale threshold for CoT benefit~100B parametersparametersWei et al. (2022): CoT benefits only emerge reliably above ~100B parameters; smaller models show no gain or regression

Chain-of-thought (CoT) prompting augments few-shot examples with intermediate reasoning steps before the final answer. Rather than (question → answer) exemplars, CoT provides (question → step-by-step reasoning → answer) exemplars, causing the model to generate its own reasoning trace when answering new questions.

Original Wei et al. (2022) Results

Wei et al. evaluated 8-shot CoT prompting with explicit reasoning steps across arithmetic, commonsense, and symbolic reasoning tasks using PaLM 540B.

DatasetStandard 8-shotCoT 8-shotGain
GSM8K (math)18.0%57.0%+39 pts
MAWPS (math)73.0%93.0%+20 pts
StrategyQA (commonsense)82.0%84.0%+2 pts
Letter Concatenation67.0%93.0%+26 pts

Zero-Shot CoT: Kojima et al. (2022)

The zero-shot variant appends “Let’s think step by step” to the prompt with no exemplars at all:

Standard: “Q: [question] A:” Zero-shot CoT: “Q: [question] A: Let’s think step by step.”

DatasetZero-shotZero-shot CoTGain
MultiArith17.7%78.7%+61 pts
GSM8K10.4%40.7%+30 pts
AddSub69.6%74.7%+5 pts
AQuA-RAT22.4%33.5%+11 pts

Why CoT Works: The Scratchpad Mechanism

Chain-of-thought converts a single multi-step prediction into a sequence of simpler next-token predictions, where each step conditions on previous reasoning steps. The generated text serves as “working memory”: the model stores intermediate results in the output stream rather than relying on internal representations to hold them across many attention layers.

Self-Consistency Sampling (Wang et al., 2022)

Generate k CoT paths independently; take majority vote on final answers:

Paths sampled (k)GSM8K Accuracy (PaLM 540B)Compute Cost
157.0%
1066.9%10×
4074.4%40×

Self-consistency trades inference compute for accuracy — highly effective when the inference budget allows multiple samples.

Scale Dependency

Model SizeCoT Benefit on GSM8K
~350MNegative (CoT hurts vs standard)
~8BMinimal / negligible
~62BSmall positive
~540BLarge (+39 percentage points)

See prompt-engineering for broader technique comparisons, emergent-capabilities for why CoT benefits emerge sharply at scale, and tool-use-function-calling for how reasoning traces guide tool selection in agentic settings.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why does chain-of-thought prompting only work at large scale?

Wei et al. (2022) tested CoT across models from ~300M to 540B parameters. Below ~100B parameters, CoT consistently equaled or underperformed standard prompting — models generated plausible-looking but incorrect reasoning chains. Above ~100B parameters, CoT reliably improved accuracy. The explanation: reasoning chains require the model to perform compositional operations (arithmetic, logical deduction) in the generated text. This requires sufficient capacity to both generate coherent language and correctly execute the intermediate computations.

What is self-consistency and how does it improve on basic CoT?

Basic CoT generates one reasoning chain and takes its final answer. Self-consistency (Wang et al., 2022) generates k diverse reasoning paths using temperature sampling and takes a majority vote over the final answers. The intuition: there are multiple valid reasoning paths to a correct answer, but incorrect reasoning produces more varied wrong answers. On GSM8K, self-consistency with k=40 adds +17% over single-path CoT (74% vs 57%), at the cost of 40× more inference compute.

← All AI pages · Dashboard