🧠 AI Fundamentals

50 pages · each with citation snippet, JSON-LD, data tables, and real research sources

🧠 Architecture

Attention Heads: Specialization, Pruning, and What Different Heads Learn

Trained transformer attention heads specialize: positional heads track adjacent tokens, syntactic heads model grammatical dependencies, semantic heads capture coreference; Voita et al. (2019) pruned 48 of 96 encoder heads with <0.1 BLEU loss.

architecture

Encoder-Decoder Architecture: Cross-Attention, Autoregressive Decoding, and Seq2Seq Performance

The transformer encoder maps n input tokens to continuous representations z; the decoder autoregressively generates m output tokens via cross-attention over z; base model achieves 28.4 BLEU WMT EN-DE (Vaswani et al., 2017).

architecture

Layer Normalization: Formula, Pre-Norm vs Post-Norm, and Training Stability

Layer normalization normalizes across d_model features per token: y = γ·(x−μ)/σ + β; applied before each sublayer in pre-norm transformers; enables stable training of 100+ layer networks (Ba et al., 2016).

architecture

Multi-Head Attention: Projection Matrices, Parameter Count, and Head Ablations

Multi-head attention uses h=8 heads with d_k=64 each; the base transformer's attention block contains ~1.05M parameters; ablations show 8 heads achieves 25.8 BLEU vs 24.9 for a single head (Vaswani et al., 2017).

architecture

Position-Wise Feed-Forward Layers: FFN Formula, Parameter Budget, and GeLU vs ReLU

Each transformer FFN layer computes max(0,xW₁+b₁)W₂+b₂ with d_ff=2048 (4× d_model=512); FFN sublayers account for ~67% of the base model's 65M parameters; GeLU outperforms ReLU on NLP benchmarks (Hendrycks & Gimpel, 2016).

architecture

Residual Connections: Skip Connections, Gradient Flow, and Deep Network Training

Residual connections compute output = x + Sublayer(x), providing a gradient highway that bypasses each sublayer; He et al. (2016) showed they enable training of 1,000-layer networks with no vanishing gradient.

architecture

Scaled Dot-Product Attention: Formula, Complexity, and the √d_k Scaling Factor

Scaled dot-product attention computes softmax(Q·Kᵀ/√d_k)·V, scaling by √d_k=8 to prevent vanishing gradients; time complexity is O(n²·d), quadratic in sequence length n (Vaswani et al., 2017).

architecture

Sinusoidal Positional Encoding: Wavelengths, Extrapolation, and Learned vs Fixed Comparison

Sinusoidal positional encodings define PE(pos,2i)=sin(pos/10000^{2i/d_model}), with wavelengths from 2π to 10000·2π; Vaswani et al. (2017) found learned and fixed encodings achieve equivalent BLEU on WMT EN-DE.

architecture

Softmax Function: Formula, Temperature Scaling, and Numerical Stability

Softmax σ(z_i) = e^{z_i}/Σe^{z_j} converts attention logits to probability distributions; temperature T<1 sharpens toward argmax (greedy), T→∞ flattens to uniform; numerically stabilized by subtracting max(z).

architecture

Transformer Architecture: Encoder-Decoder Design and Dimensions

The original transformer uses 6 encoder and 6 decoder layers, d_model=512, 8 attention heads, and 64M parameters; trained on WMT 2014 English-German to achieve 28.4 BLEU (Vaswani et al., 2017).

🧠 Representation

representation

Attention Is All You Need: The Transformer Paper — Key Results and Impact

Vaswani et al. (NeurIPS 2017) introduced the transformer architecture, achieving 28.4 BLEU on WMT EN-DE — surpassing all prior models including ensembles — with a 64M parameter model trained for 12 hours on 8 P100 GPUs.

representation

Byte-Pair Encoding: Algorithm, Vocabulary Construction, and Byte-Level BPE

BPE iteratively merges the most frequent adjacent byte pair until vocabulary reaches target size; GPT-2 applies BPE on raw UTF-8 bytes, producing 50,257 tokens with guaranteed full Unicode coverage and zero unknown tokens (Radford et al., 2019).

representation

Context Window: Sequence Length Limits, Positional Methods, and Long-Context Extensions

Original transformer context window: 512 tokens; self-attention memory scales as O(n²), making long contexts expensive; RoPE enables extrapolation beyond training length; some architectures reach 1M-token context windows.

representation

Knowledge Distillation: Soft Targets, Temperature Scaling, and Compression Ratios

Knowledge distillation trains a student on teacher soft labels at temperature T; DistilBERT achieves 97% of BERT's GLUE score at 60% the size (66M vs 110M parameters) using T=4 and a distillation loss coefficient of 0.9 (Sanh et al., 2019).

representation

KV Cache: Key-Value Caching for Efficient Autoregressive Inference

KV caching stores key-value pairs from previous tokens, reducing inference FLOPs per step from O(n·d²) to O(d²); cache size for a 32-layer, 32-head, d_head=128 model at 4K tokens is ~536 MB at fp16.

representation

Mixture of Experts: Sparse Gating, Switch Transformer, and Efficient Scaling

Sparse MoE routes each token to top-k of N expert FFN layers; Switch Transformer (Fedus et al., 2022) uses k=1 routing to scale to 1.6T parameters, activating ~7B per token — 7× pre-training speedup over dense T5-11B.

representation

Tokenization: Subword Units, Vocabulary Size, and Characters Per Token

Subword tokenization with BPE produces vocabularies of 32K–100K units; GPT-2's 50,257-token vocabulary averages ~4 characters per English token; a 1,000-word paragraph encodes to approximately 1,300 tokens.

representation

Word Embeddings: Distributed Representations, word2vec, and Semantic Geometry

word2vec skip-gram learns 300-dim embeddings where cosine similarity encodes semantics; vector arithmetic king − man + woman ≈ queen holds with ~76% accuracy; GloVe achieves 75.0% on word analogy tasks (Mikolov et al., 2013).

🧠 Inference

inference

Autoregressive Decoding: Greedy, Beam Search, and Sampling Strategies

Beam search (Sutskever et al., 2014) maintains k candidate sequences by cumulative log-probability; top-p nucleus sampling (Holtzman et al., 2020) dynamically selects the minimum token set covering probability mass p.

inference

Beam Search: Approximate Sequence Optimization in Autoregressive Models

Beam search maintains k hypotheses; at each step expands k×|V| continuations, retains top-k by Σ log P(y_t|y_{<t}); Sutskever et al. (NeurIPS 2014) used k=2–12 for seq2seq MT; Holtzman et al. (2020) showed beam-decoded text has lower human preference than nucleus-sampled text for open generation.

inference

Inference vs Training Compute: FLOPs per Token vs Total Training Cost

Training FLOPs ≈ 6·N·D for dense transformers (N parameters, D tokens); inference ≈ 2·N FLOPs per token; a 70B model requires ~1.4×10¹¹ FLOPs per token vs ~5.9×10²³ total training FLOPs (Chinchilla-optimal); training = ~4.2 trillion inference tokens equivalent.

inference

Quantization: Reducing Numerical Precision in Neural Network Weights and Activations

LLM.int8() (Dettmers et al., NeurIPS 2022): mixed-precision INT8 with FP16 for 0.1% outlier features enables inference at 8-bit with no accuracy degradation; GPTQ (Frantar et al., ICLR 2023): Hessian-compensated INT4 achieves <1% perplexity increase on 175B-scale models.

inference

Temperature Sampling: Controlling Randomness in Autoregressive Language Model Generation

Temperature T scales logits before softmax: p_i = exp(z_i/T) / Σ exp(z_j/T); T→0 approaches greedy decoding; T=1 is standard softmax; T>1 increases entropy. Holtzman et al. (ICLR 2020) showed T-based sampling produces incoherent text at high values without truncation.

inference

Top-p (Nucleus) Sampling: Adaptive Vocabulary Truncation for Language Model Decoding

Nucleus (top-p) sampling: select smallest V' ⊆ V such that Σ_{w∈V'} p(w|context) ≥ p, renormalize, sample; Holtzman et al. (ICLR 2020) showed top-p=0.9 produces text more preferred by humans than top-k, temperature-only, or greedy decoding across all evaluated metrics.

🧠 Training

training

Backpropagation: Chain Rule, Computational Graphs, and Automatic Differentiation

Backpropagation computes ∂L/∂W via the chain rule: ∂L/∂W = ∂L/∂output · ∂output/∂W; a single backward pass computes all N parameter gradients in O(N) operations — same asymptotic cost as the forward pass (Rumelhart et al., 1986).

training

Chinchilla Scaling: Compute-Optimal Training and the 20-Token-Per-Parameter Rule

Chinchilla scaling (Hoffmann et al., 2022): compute-optimal training uses ~20 tokens per parameter; Chinchilla-70B (1.4T tokens) outperforms Gopher-280B (300B tokens) using 4× less compute, showing prior large models were severely undertrained.

training

Compute FLOPs: Counting Training and Inference Operations for Language Models

Training FLOPs ≈ 6·N·D for dense transformers (N parameters, D tokens); inference costs ≈ 2·N FLOPs per token; an A100 GPU delivers 312 TFLOPS (BF16), making GPT-3 training require ~10⁴ A100-days.

training

Gradient Descent and Adam Optimizer: Update Rules and Hyperparameters

SGD updates θ ← θ − η∇L; Adam adapts per-parameter learning rates using m_t = β₁m_{t-1}+(1−β₁)g_t and v_t = β₂v_{t-1}+(1−β₂)g_t²; typical transformer settings: β₁=0.9, β₂=0.95–0.999, ε=1e-8 (Kingma & Ba, 2015).

training

Masked Language Modeling: BERT's Pre-Training Objective and Bidirectional Context

BERT's MLM masks 15% of tokens — 80% replaced with [MASK], 10% random token, 10% unchanged — enabling bidirectional context encoding; BERT-large achieved 87.6 on GLUE, surpassing all prior models by 7.0 points (Devlin et al., 2019).

training

Neural Network Fundamentals: Universal Approximation, Depth vs Width, and Activation Functions

Universal approximation theorem: a 2-layer network with n hidden units approximates any continuous function on compact subsets of ℝ^d (Cybenko, 1989); depth enables exponentially more compact representations than equivalent-width shallow networks.

training

Next-Token Prediction: Causal Language Modeling Objective and Perplexity

Causal language modeling maximizes log P(x) = Σₜ log P(x_t | x_{<t}); perplexity = exp(−(1/N)Σ log P(x_t|context)); GPT-2 117M achieved perplexity 35.1 on Penn Treebank without fine-tuning (Radford et al., 2019).

training

Pre-Training: Self-Supervised Learning on Large Text Corpora

Pre-training on large corpora with self-supervised objectives (causal LM or MLM) produces general representations; GPT-3 was pre-trained on 300B tokens at 175B parameters using ~3.14×10²³ FLOPs (Brown et al., 2020).

training

Scaling Laws: How Language Model Performance Scales with Parameters, Data, and Compute

Kaplan et al. (2020) found L ∝ N^{-0.076} and L ∝ D^{-0.095}; Chinchilla (2022) revised: optimal N and D both scale as C^{0.5}, so a 70B model should train on ~1.4T tokens to be compute-optimal.

training

Training Data Curation: Web Filtering, Deduplication, and Quality Selection

Common Crawl contains 400B+ tokens of raw web text; quality filtering (perplexity scoring, deduplication, URL filtering) retains ~5–20% as training data; Penedo et al. (2023) FineWeb showed filtered quality improves benchmark scores by 2–4 points.

🧠 Agents & Applications

agents-applications

Chain-of-Thought Prompting: Intermediate Reasoning Steps Improve Multi-Step Accuracy

Wei et al. (NeurIPS 2022): adding step-by-step reasoning to 8-shot examples raised PaLM 540B GSM8K accuracy 18% → 57%; Kojima et al. (2022): zero-shot CoT 'Let's think step by step' raised MultiArith 17.7% → 78.7%; self-consistency (Wang et al., 2022) adds +17% via majority vote.

agents-applications

Few-Shot Learning: Language Model Task Performance from k In-Context Demonstrations

Brown et al. (NeurIPS 2020): GPT-3 175B 32-shot SuperGLUE = 79.3 vs fine-tuned BERT 88.9; Zhao et al. (ICML 2021): different orderings of same k examples produce up to ±15% accuracy variance; calibrating against neutral-input priors reduces order sensitivity.

agents-applications

In-Context Learning: Task Adaptation from Prompt Examples Without Weight Updates

Brown et al. (NeurIPS 2020) GPT-3: k-shot ICL from prompt examples without weight updates; 32-shot achieves 79.3 on SuperGLUE vs 88.9 fine-tuned BERT; Min et al. (EMNLP 2022): randomly flipping demonstration labels drops accuracy only ~10%, indicating format/distribution matters more than correct labels.

agents-applications

Prompt Engineering: Systematic Input Design for Language Model Accuracy and Reliability

Zero-shot CoT (Kojima et al., 2022): MultiArith 17.7% → 78.7%; self-consistency 40 samples (Wang et al., 2022): GSM8K +17%; APE (Zhou et al., ICLR 2023): LLM-generated instructions outperform human-written prompts on 19 of 24 InstructGPT benchmarks.

agents-applications

Retrieval-Augmented Generation: Grounding Language Models in External Knowledge

RAG (Lewis et al., NeurIPS 2020): retriever encodes query q and passages d_i as dense vectors; top-k retrieved by maximum inner product search; RAG-Token achieved 44.5% Exact Match on NaturalQuestions vs 29.8% closed-book T5; DPR retriever top-1 accuracy 78.4% vs BM25 59.1%.

agents-applications

Tool Use and Function Calling: Language Models Invoking External Functions

Toolformer (Schick et al., NeurIPS 2023): model self-supervises API call insertion; reduces perplexity on 5 tools vs baseline; ReAct (Yao et al., ICLR 2023): interleaved reasoning+actions raise HotpotQA EM from 29.0% to 35.1% and ALFWorld success from 25% to 71%.

🧠 Alignment

alignment

Constitutional AI: Self-Critique, Revision, and Principle-Based Alignment

Constitutional AI (Bai et al., 2022) uses a written constitution to guide self-critique and revision; RLAIF with AI feedback replaces human labeling on harmless/harmful dimension, achieving harmlessness comparable to RLHF with ~80% less human feedback on harm.

alignment

Fine-Tuning Language Models: Full, Adapter, and Parameter-Efficient Methods

Howard & Ruder (2018) ULMFiT established pretraining + fine-tuning as the dominant NLP paradigm; PEFT methods (LoRA, adapters) achieve within 1% of full fine-tuning quality while updating <1% of parameters.

alignment

Instruction Tuning: Zero-Shot Generalization via Multi-Task Fine-Tuning

Wei et al. (2022) FLAN: instruction-tuning on 62 tasks improves zero-shot performance on 25 of 25 held-out tasks; 137B FLAN outperforms GPT-3 175B zero-shot on 20 of 25 tasks.

alignment

LoRA: Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

LoRA (Hu et al., 2021): rank-4 decomposition ΔW=BA reduces trainable parameters to 0.01% of full model while matching full fine-tuning BLEU on E2E NLG; no inference latency added after weight merging.

alignment

Reinforcement Learning Basics: MDPs, Policy Gradients, and PPO

PPO (Schulman et al., 2017): clipped surrogate objective prevents destructive policy updates; achieves better sample efficiency than TRPO with simpler implementation; PPO is the standard RL optimizer in RLHF pipelines.

alignment

RLHF: Reinforcement Learning from Human Feedback — Reward Model and PPO

RLHF trains a reward model on human pairwise preferences, then optimizes via PPO with KL penalty: R = r_θ(x,y) − β·KL(π_RL || π_SFT); introduced for language models by Stiennon et al. (NeurIPS 2020), extended by InstructGPT (Ouyang et al., 2022).

alignment

The Alignment Problem: Specifying and Optimizing for Human Values

Goodhart's law (1975): 'When a measure becomes a target, it ceases to be a good measure.' In AI alignment, reward proxies optimized by RL often diverge from intended behavior; RLHF partially addresses this via learned reward models.

🧠 Evaluation

evaluation

Emergent Capabilities: Abilities That Appear Above Scale Thresholds in Language Models

Wei et al. (TMLR 2022): 137 tasks show near-zero then sharp improvement above scale thresholds in 8 model families; 3-digit arithmetic emerges ~8–13B parameters; Schaeffer et al. (NeurIPS 2023): switching to continuous metrics largely eliminates apparent discontinuities, suggesting measurement artifact.

evaluation

Hallucination Mechanisms: Why Language Models Generate Plausible but Incorrect Text

Ji et al. (ACM Computing Surveys 2023): hallucination = content unsupported or contradicted by source; Mallen et al. (ACL 2023): entities in bottom-25% training frequency show 4–14× higher hallucination rates than top-25%; Maynez et al. (ACL 2020): ~30% of abstractive summaries contain hallucinated content.

evaluation

Perplexity: Information-Theoretic Measure of Language Model Prediction Quality

PPL(W) = exp(−(1/N) Σ log P(w_i|w_{<i})) = exp(cross-entropy per token); GPT-2 117M zero-shot: 35.1 PPL on Penn Treebank (Radford et al., 2019); GPT-3 175B zero-shot: 20.5 PPL (Brown et al., 2020); 4-gram KN baseline: 141.2 PPL; human-level estimated ~10–20 PPL.

🧠 🧠 🧠

50 fact pages covering transformer architecture, representation, training dynamics, alignment, inference, agents, and evaluation. ← Dashboard