Prompt Engineering: Systematic Input Design for Language Model Accuracy and Reliability

Category: agents-applications Updated: 2026-02-27

Zero-shot CoT (Kojima et al., 2022): MultiArith 17.7% → 78.7%; self-consistency 40 samples (Wang et al., 2022): GSM8K +17%; APE (Zhou et al., ICLR 2023): LLM-generated instructions outperform human-written prompts on 19 of 24 InstructGPT benchmarks.

Key Data Points
MeasureValueUnitNotes
Zero-shot CoT: MultiArith accuracy17.7% → 78.7%% accuracyKojima et al. (2022): PaLM 540B; zero-shot 'Let's think step by step' vs standard zero-shot
Self-consistency k=40: GSM8K gain+17 percentage points% accuracyWang et al. (2022): single CoT 57% → self-consistency 74% on GSM8K with PaLM 540B
APE outperformance rate19 of 24 taskstasks improvedZhou et al. (2022): APE-generated instructions outperform human-written prompts on 19/24 benchmarks
Example order accuracy varianceup to ±15%% accuracySame examples in different orderings; calibration or self-consistency reduces variance

Prompt engineering is the practice of systematically designing input text to improve language model output accuracy, reliability, and consistency. Unlike fine-tuning, it requires no weight updates — performance improvements arise entirely from structuring the information presented to the model at inference time.

Core Techniques

Zero-Shot Prompting

No examples provided. Task defined by instruction alone:

  • Basic: “Classify the sentiment: {text}”
  • Role: “You are an expert analyst. Classify the sentiment: {text}”
  • Zero-shot CoT: “Solve: {problem}. Let’s think step by step.”

Few-Shot Prompting

Provide k labeled examples before the test input. See few-shot-learning for accuracy numbers.

Chain-of-Thought (CoT)

Include intermediate reasoning steps in examples. See chain-of-thought for performance data.

Self-Consistency

Sample k diverse reasoning paths; take majority vote on final answer.

Accuracy Impact by Technique (GSM8K, PaLM 540B)

TechniqueGSM8K AccuracySource
Standard 8-shot18.0%Wei et al. (2022)
8-shot CoT57.0%Wei et al. (2022)
Zero-shot CoT40.7%Kojima et al. (2022)
8-shot CoT + self-consistency (k=40)74.4%Wang et al. (2022)

Prompt Sensitivity Variables

VariableAccuracy EffectMitigation
Example order±15% varianceCalibration; majority vote over orderings
Instruction wording±5–20%Automatic prompt optimization (APE)
Number of examples+2–5% per example up to saturationUse context window limit
Temperature±5–10% for fixed seedSelf-consistency reduces variance

Automatic Prompt Engineering (APE) — Zhou et al. (2022)

APE uses the target LM to generate and score candidate prompt instructions:

  1. Generate: “Generate instructions that cause a model to produce the following examples: [demonstrations]”
  2. Score: evaluate each instruction on validation set
  3. Return: highest-scoring instruction

APE outperformed human prompts on 19/24 InstructGPT tasks, with the largest gains on tasks where the optimal phrasing is non-obvious to humans.

Prompt Engineering vs Instruction Tuning

PropertyPrompt EngineeringInstruction Tuning
Weight updatesNoneYes (fine-tuning)
GeneralizationTask-specific promptAcross many instruction types
Data required0–32 examplesThousands of labeled instructions
LatencyLonger context windowShorter prompts needed

See instruction-tuning for training-time alternatives to prompting, chain-of-thought for the highest-impact single technique, and temperature-sampling for the inference parameter that interacts with self-consistency.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What is the most empirically well-supported prompt engineering technique?

Chain-of-thought prompting (Wei et al., 2022) has the strongest evidence: +20–40% accuracy on multi-step arithmetic and reasoning tasks at large scale. Self-consistency (Wang et al., 2022) stacks another +10–17% via majority voting over multiple sampled CoT paths. Zero-shot CoT 'Let's think step by step' (Kojima et al., 2022) delivers large gains with no example data. These three techniques have been replicated across multiple model families and task types.

What is Automatic Prompt Engineer (APE)?

Zhou et al. (2022) proposed using a language model to generate and evaluate candidate instructions: (1) prompt the model with few-shot examples to generate N candidate instruction strings; (2) evaluate each instruction on a validation set; (3) return the highest-scoring instruction. APE outperformed human-written instructions on 19 of 24 InstructGPT benchmarks, suggesting LLMs can optimize prompts better than humans partly because they have seen the distribution of effective instructions during pre-training.

← All AI pages · Dashboard