Prompt Engineering: Systematic Input Design for Language Model Accuracy and Reliability
Zero-shot CoT (Kojima et al., 2022): MultiArith 17.7% → 78.7%; self-consistency 40 samples (Wang et al., 2022): GSM8K +17%; APE (Zhou et al., ICLR 2023): LLM-generated instructions outperform human-written prompts on 19 of 24 InstructGPT benchmarks.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Zero-shot CoT: MultiArith accuracy | 17.7% → 78.7% | % accuracy | Kojima et al. (2022): PaLM 540B; zero-shot 'Let's think step by step' vs standard zero-shot |
| Self-consistency k=40: GSM8K gain | +17 percentage points | % accuracy | Wang et al. (2022): single CoT 57% → self-consistency 74% on GSM8K with PaLM 540B |
| APE outperformance rate | 19 of 24 tasks | tasks improved | Zhou et al. (2022): APE-generated instructions outperform human-written prompts on 19/24 benchmarks |
| Example order accuracy variance | up to ±15% | % accuracy | Same examples in different orderings; calibration or self-consistency reduces variance |
Prompt engineering is the practice of systematically designing input text to improve language model output accuracy, reliability, and consistency. Unlike fine-tuning, it requires no weight updates — performance improvements arise entirely from structuring the information presented to the model at inference time.
Core Techniques
Zero-Shot Prompting
No examples provided. Task defined by instruction alone:
- Basic: “Classify the sentiment: {text}”
- Role: “You are an expert analyst. Classify the sentiment: {text}”
- Zero-shot CoT: “Solve: {problem}. Let’s think step by step.”
Few-Shot Prompting
Provide k labeled examples before the test input. See few-shot-learning for accuracy numbers.
Chain-of-Thought (CoT)
Include intermediate reasoning steps in examples. See chain-of-thought for performance data.
Self-Consistency
Sample k diverse reasoning paths; take majority vote on final answer.
Accuracy Impact by Technique (GSM8K, PaLM 540B)
| Technique | GSM8K Accuracy | Source |
|---|---|---|
| Standard 8-shot | 18.0% | Wei et al. (2022) |
| 8-shot CoT | 57.0% | Wei et al. (2022) |
| Zero-shot CoT | 40.7% | Kojima et al. (2022) |
| 8-shot CoT + self-consistency (k=40) | 74.4% | Wang et al. (2022) |
Prompt Sensitivity Variables
| Variable | Accuracy Effect | Mitigation |
|---|---|---|
| Example order | ±15% variance | Calibration; majority vote over orderings |
| Instruction wording | ±5–20% | Automatic prompt optimization (APE) |
| Number of examples | +2–5% per example up to saturation | Use context window limit |
| Temperature | ±5–10% for fixed seed | Self-consistency reduces variance |
Automatic Prompt Engineering (APE) — Zhou et al. (2022)
APE uses the target LM to generate and score candidate prompt instructions:
- Generate: “Generate instructions that cause a model to produce the following examples: [demonstrations]”
- Score: evaluate each instruction on validation set
- Return: highest-scoring instruction
APE outperformed human prompts on 19/24 InstructGPT tasks, with the largest gains on tasks where the optimal phrasing is non-obvious to humans.
Prompt Engineering vs Instruction Tuning
| Property | Prompt Engineering | Instruction Tuning |
|---|---|---|
| Weight updates | None | Yes (fine-tuning) |
| Generalization | Task-specific prompt | Across many instruction types |
| Data required | 0–32 examples | Thousands of labeled instructions |
| Latency | Longer context window | Shorter prompts needed |
See instruction-tuning for training-time alternatives to prompting, chain-of-thought for the highest-impact single technique, and temperature-sampling for the inference parameter that interacts with self-consistency.
Related Pages
Sources
- Wei et al. (2022) — Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022
- Kojima et al. (2022) — Large Language Models are Zero-Shot Reasoners. NeurIPS 2022
- Wang et al. (2022) — Self-Consistency Improves Chain of Thought Reasoning. ICLR 2023
- Zhou et al. (2022) — Large Language Models Are Human-Level Prompt Engineers. ICLR 2023
Frequently Asked Questions
What is the most empirically well-supported prompt engineering technique?
Chain-of-thought prompting (Wei et al., 2022) has the strongest evidence: +20–40% accuracy on multi-step arithmetic and reasoning tasks at large scale. Self-consistency (Wang et al., 2022) stacks another +10–17% via majority voting over multiple sampled CoT paths. Zero-shot CoT 'Let's think step by step' (Kojima et al., 2022) delivers large gains with no example data. These three techniques have been replicated across multiple model families and task types.
What is Automatic Prompt Engineer (APE)?
Zhou et al. (2022) proposed using a language model to generate and evaluate candidate instructions: (1) prompt the model with few-shot examples to generate N candidate instruction strings; (2) evaluate each instruction on a validation set; (3) return the highest-scoring instruction. APE outperformed human-written instructions on 19 of 24 InstructGPT benchmarks, suggesting LLMs can optimize prompts better than humans partly because they have seen the distribution of effective instructions during pre-training.