Neural Network Fundamentals: Universal Approximation, Depth vs Width, and Activation Functions

Category: training Updated: 2026-02-27

Universal approximation theorem: a 2-layer network with n hidden units approximates any continuous function on compact subsets of ℝ^d (Cybenko, 1989); depth enables exponentially more compact representations than equivalent-width shallow networks.

Key Data Points
MeasureValueUnitNotes
Universal approximation theorem∃ 2-layer net that ε-approximates any f ∈ C([0,1]^d)Cybenko (1989); requires potentially exponential width; depth reduces this
Depth efficiency (exponential)Exponentialfunctions representableDepth-L network can represent functions requiring exponential width in a depth-1 network
ReLU activationf(x) = max(0, x)Simple; enables vanishing gradient mitigation; creates piecewise-linear functions
GeLU at x=00.5outputGeLU(x) = x·Φ(x) where Φ is Gaussian CDF; smooth approximation to ReLU
He initialization for ReLUW ~ N(0, 2/n_in)He et al. (2015): variance scaled by 2/n for ReLU to preserve activation variance

Neural networks are parameterized compositions of linear transformations and nonlinear activation functions. Understanding their fundamental mathematical properties — expressivity, trainability, and initialization — is essential context for understanding why modern architectures like transformers are designed the way they are.

The Universal Approximation Theorem

Theorem (Cybenko, 1989): Let φ: ℝ → ℝ be any continuous sigmoidal function. For any continuous function f: [0,1]^d → ℝ and ε > 0, there exists a finite linear combination F(x) = Σᵢ αᵢ φ(wᵢᵀx + bᵢ) such that:

|F(x) − f(x)| < ε for all x ∈ [0,1]^d

This establishes that 2-layer networks are universal. The catch: the number of hidden units required may be exponential in d.

Depth vs Width Efficiency

RepresentationShallow Network Width RequiredDeep Network Width Required
Parity function on n bits2^{n-1} unitsO(n) units (layered)
Polynomial degree dExponential in dO(d) units per layer
XOR on n variables2^{n-1} unitsO(n) units, O(log n) layers

Depth enables exponentially compact representation of hierarchically structured functions.

Activation Functions

ActivationFormulaDerivativeProperties
Sigmoidσ(x) = 1/(1+e^{-x})σ(1−σ)Saturates; vanishing gradients
Tanhtanh(x) = (e^x−e^{-x})/(e^x+e^{-x})1−tanh²Zero-centered; still saturates
ReLUmax(0, x)𝟙[x>0]No saturation for x>0; “dying ReLU” for x<0
Leaky ReLUmax(αx, x), α=0.01𝟙[x>0] + α·𝟙[x≤0]Small gradient for x<0
GeLUx·Φ(x)Φ(x) + x·φ(x)Smooth; used in modern transformers
SiLU/Swishx·σ(x)σ(x)(1+x(1−σ(x)))Smooth; non-monotonic

Weight Initialization

Proper initialization prevents gradient explosion and vanishing at the start of training:

InitializationFormulaDesigned For
Xavier/GlorotW ~ U(−√(6/(n_in+n_out)), √(6/(n_in+n_out)))Sigmoid, tanh
He / KaimingW ~ N(0, √(2/n_in))ReLU
LeCunW ~ N(0, 1/n_in)SELU
Small constantW ~ N(0, 0.02)Transformers (embedding, output)

The transformer uses a modified initialization: embedding weights ∈ N(0, 0.02), other weights ∈ N(0, d_model^{-0.5}).

See feed-forward-layers for how activation functions appear in transformer FFN layers, residual-connections for how depth is made trainable via skip connections, and backpropagation for gradient computation through these activation functions.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What does the universal approximation theorem actually guarantee?

The theorem (Cybenko 1989, Hornik 1991) proves that a feedforward neural network with a single hidden layer and a non-polynomial activation function can approximate any continuous function on a compact domain to arbitrary precision, given sufficient hidden units. The key limitation: it does not guarantee learning — only existence. The required width can be exponential in input dimension, and gradient descent may not find the optimal solution.

Why does depth matter if shallow networks are universal approximators?

While shallow (2-layer) networks are universal approximators in theory, they may require exponentially many hidden units to represent certain functions. Deep networks can represent the same functions with exponentially fewer total neurons by composing simpler functions. In practice, depth enables hierarchical feature learning — early layers detect simple patterns; later layers combine them into complex abstractions. This is why all successful large language models use dozens to hundreds of layers.

Why did ReLU replace sigmoid and tanh as the standard activation?

Sigmoid and tanh saturate at large positive and negative values, producing near-zero gradients (vanishing gradient problem). ReLU = max(0,x) has gradient exactly 1 for positive inputs and 0 for negative inputs. The constant gradient for positive inputs prevents vanishing gradients in deep networks. ReLU also computes faster than sigmoid/tanh and produces sparse activations (many neurons are exactly zero), which has a useful regularization effect.

← All AI pages · Dashboard