Neural networks are only as good as their weights, and finding good weights is the central challenge of machine learning. This chapter explains how networks learn from data using gradient descent and backpropagation - the mathematical machinery that makes modern AI possible.
The loss function: Measuring wrongness
Before we can improve a network, we need to measure how wrong it is. A loss function (or cost function) takes the network's predictions and the true answers, returning a single number quantifying the error. Lower is better.
For regression, Mean Squared Error (MSE) is common: average the squared differences between predictions and targets. Squaring emphasizes large errors and makes the function differentiable everywhere. For classification, Cross-Entropy Loss measures how different the predicted probability distribution is from the true distribution.
The choice of loss function encodes your priorities. MSE penalizes large errors quadratically, making the model very averse to outliers. Mean Absolute Error (MAE) treats all errors linearly, being more robust to outliers but less smooth to optimize. Huber loss combines both: quadratic for small errors, linear for large ones.
Cross-entropy loss has an elegant interpretation: it measures the number of extra bits needed to encode data using the predicted distribution instead of the true distribution. Minimizing cross-entropy is equivalent to maximum likelihood estimation - finding parameters that make the observed data most probable.
For multi-class classification, the softmax function converts raw network outputs (logits) into probabilities: softmax(z_i) = exp(z_i) / Σ exp(z_j). Cross-entropy with softmax has a particularly clean gradient: the derivative is simply (predicted - actual), which is numerically stable when computed together.
import numpy as np
def mse_loss(predictions, targets):
"""Mean Squared Error for regression."""
return np.mean((predictions - targets) ** 2)
def cross_entropy_loss(predictions, targets):
"""Cross-entropy for classification."""
# Clip to avoid log(0)
predictions = np.clip(predictions, 1e-15, 1 - 1e-15)
return -np.mean(targets * np.log(predictions))
def softmax(logits):
"""Convert logits to probabilities."""
# Subtract max for numerical stability
exp_logits = np.exp(logits - np.max(logits))
return exp_logits / np.sum(exp_logits)
def huber_loss(predictions, targets, delta=1.0):
"""Huber loss - robust to outliers."""
error = predictions - targets
is_small_error = np.abs(error) <= delta
squared_loss = 0.5 * error ** 2
linear_loss = delta * np.abs(error) - 0.5 * delta ** 2
return np.mean(np.where(is_small_error, squared_loss, linear_loss))Gradient descent: Sliding downhill
Imagine the loss function as a landscape where height represents error. We want to find the lowest point - the valley where our network makes the fewest mistakes. Gradient descent is our hiking strategy: look around, find which direction goes downhill fastest, take a step that way, repeat.
The gradient is a vector of partial derivatives telling us how the loss changes with each weight. It points uphill toward increasing loss. So we move opposite to the gradient - downhill toward lower loss. The learning rate controls step size: too large and we overshoot; too small and we inch along forever.
Mathematically, the gradient ∇L is a vector containing ∂L/∂w_i for each weight. In the direction of the gradient, the loss increases fastest. The magnitude of the gradient indicates steepness - large gradients mean we are far from a minimum; small gradients mean we are approaching one or stuck on a plateau.
The update rule is simply: w_new = w_old - η * ∇L, where η is the learning rate. This can be derived from Taylor expansion: near a point, L(w + Δw) ≈ L(w) + ∇L · Δw. To decrease L, we want ∇L · Δw < 0. Choosing Δw = -η∇L guarantees this since ∇L · (-η∇L) = -η||∇L||² < 0.
Learning rate selection is crucial. If η is too large, updates overshoot minima and training diverges - loss oscillates wildly or explodes to infinity. If η is too small, training converges painfully slowly, potentially getting stuck in suboptimal regions. Modern practice often starts with a larger rate and decreases it over time.
def gradient_descent(weights, gradients, learning_rate):
"""One step of gradient descent."""
return weights - learning_rate * gradients
# Learning rate schedules
def step_decay(initial_lr, epoch, drop_rate=0.5, epochs_drop=10):
"""Reduce learning rate by factor every N epochs."""
return initial_lr * (drop_rate ** (epoch // epochs_drop))
def cosine_annealing(initial_lr, epoch, total_epochs):
"""Smoothly decrease learning rate following cosine curve."""
return initial_lr * (1 + np.cos(np.pi * epoch / total_epochs)) / 2
def warmup_schedule(initial_lr, epoch, warmup_epochs=5):
"""Gradually increase learning rate at start of training."""
if epoch < warmup_epochs:
return initial_lr * (epoch + 1) / warmup_epochs
return initial_lrBackpropagation: Computing gradients efficiently
A neural network might have millions of weights. Computing gradients naively - perturbing each weight and measuring loss change - would require millions of forward passes per update. Backpropagation computes all gradients in just one forward pass plus one backward pass.
The key insight is the chain rule from calculus. If loss depends on layer 3, which depends on layer 2, which depends on layer 1, we can compute how loss changes with layer 1 by multiplying the derivatives along the chain. Working backward from output to input - hence backpropagation - we efficiently compute every gradient.
The chain rule states that if y = f(g(x)), then dy/dx = (dy/dg) * (dg/dx). For neural networks, the loss depends on the output, which depends on the last hidden layer, which depends on the previous layer, and so on. The gradient for any weight is the product of all local derivatives along the path from that weight to the loss.
Backpropagation exploits the fact that many paths share subpaths. Instead of recomputing shared derivatives, we compute them once and reuse them. Starting from the output, we compute ∂L/∂a for each layer's activations, then use these to compute ∂L/∂w for the weights. This ordering ensures each derivative is computed exactly once.
For a layer with output a = f(Wa_{prev} + b), the gradients are: ∂L/∂W = ∂L/∂a * f'(z) * a_{prev}^T and ∂L/∂a_{prev} = W^T * ∂L/∂a * f'(z). The term ∂L/∂a is passed backward from the next layer; f'(z) is the local derivative of the activation function; and the matrix multiplications combine these with the layer's weights and inputs.
def backpropagation(X, y, weights, biases, activations):
"""Compute gradients via backpropagation."""
m = X.shape[0] # batch size
grads_W = []
grads_b = []
# Output layer error (assuming cross-entropy + softmax)
dA = activations[-1] - y
# Propagate backward through layers
for i in reversed(range(len(weights))):
# Gradient for weights and biases
dW = np.dot(activations[i].T, dA) / m
db = np.sum(dA, axis=0) / m
grads_W.insert(0, dW)
grads_b.insert(0, db)
# Propagate to previous layer
if i > 0:
dA = np.dot(dA, weights[i].T)
dA *= (activations[i] > 0) # ReLU derivative
return grads_W, grads_b
# Numerical gradient check (for debugging)
def numerical_gradient(f, x, eps=1e-5):
"""Compute gradient numerically for verification."""
grad = np.zeros_like(x)
for i in range(x.size):
x_plus = x.copy(); x_plus.flat[i] += eps
x_minus = x.copy(); x_minus.flat[i] -= eps
grad.flat[i] = (f(x_plus) - f(x_minus)) / (2 * eps)
return gradStochastic gradient descent
Computing the exact gradient requires processing the entire dataset - impractical for millions of examples. Stochastic Gradient Descent (SGD) approximates the gradient using a small random batch, typically 32-256 examples. This is noisier but much faster and often escapes local minima better.
The key insight behind SGD is that the average gradient over a random subset approximates the true gradient in expectation. While any single batch gradient might point in the wrong direction, over many iterations the errors average out. This stochastic approximation enables training on datasets too large to fit in memory.
Batch size affects both training dynamics and final performance. Smaller batches provide more gradient noise, which can help escape sharp local minima that generalize poorly. Larger batches give more accurate gradients but may converge to sharper minima. Many practitioners find batch sizes of 32-128 work well across diverse tasks.
One epoch is one complete pass through the training data. Shuffling the data between epochs ensures the model sees examples in different orders, preventing it from learning spurious patterns based on data ordering. Training typically runs for tens to hundreds of epochs until validation performance plateaus.
Advanced optimizers: Momentum and Adam
Basic SGD can oscillate in narrow valleys and slow down near flat regions. Momentum adds velocity - the update accumulates over iterations, building speed in consistent directions and dampening oscillations. It is like a ball rolling downhill with inertia.
Adam (Adaptive Moment Estimation) combines momentum with per-parameter learning rates. Parameters with large gradients get smaller steps; parameters with small gradients get larger steps. Adam adapts to the loss landscape, making it robust across many problems. It has become the default optimizer for most deep learning.
Momentum maintains a velocity vector v that accumulates past gradients: v = βv + g, then w = w - ηv. The hyperparameter β (typically 0.9) controls how quickly past gradients decay. High momentum means the optimizer remembers further back, helping traverse flat regions and dampening oscillations in curved ones.
RMSprop adapts learning rates per-parameter using a running average of squared gradients: s = γs + (1-γ)g², then w = w - η * g / √(s + ε). Parameters with large recent gradients get smaller learning rates, preventing overshooting. The ε term (typically 1e-8) prevents division by zero.
AdamW modifies Adam by decoupling weight decay from the gradient update. Instead of adding L2 regularization to the loss, AdamW directly shrinks weights: w = w - η*(m̂/√(v̂+ε) + λw). This subtle change significantly improves generalization and has become standard for training transformers.
The training loop
Putting it all together: training loops through the dataset repeatedly (epochs), processing batches, computing gradients, and updating weights. We monitor loss on held-out validation data to detect overfitting - when training loss keeps dropping but validation loss rises.
Early stopping is a simple but effective regularization technique. Monitor validation loss and stop training when it has not improved for a patience period (e.g., 10 epochs). This prevents overfitting by halting before the model memorizes training data. The best model checkpoint is saved during training.
Gradient clipping prevents exploding gradients by capping their magnitude. If the gradient norm exceeds a threshold (e.g., 1.0), scale all gradients down proportionally. This is essential for training RNNs and useful for stabilizing any deep network. It prevents a single bad batch from destroying learned weights.
def train(model, X_train, y_train, X_val, y_val, epochs=100, lr=0.001):
optimizer = Adam(model.parameters(), lr=lr)
best_val_loss = float('inf')
patience_counter = 0
for epoch in range(epochs):
# Shuffle training data
indices = np.random.permutation(len(X_train))
X_train, y_train = X_train[indices], y_train[indices]
# Shuffle and batch
for X_batch, y_batch in get_batches(X_train, y_train, batch_size=32):
# Forward pass
predictions = model(X_batch)
loss = cross_entropy(predictions, y_batch)
# Backward pass
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
optimizer.zero_grad()
# Evaluate on validation set
val_loss = evaluate(model, X_val, y_val)
print(f"Epoch {epoch}: train_loss={loss:.4f}, val_loss={val_loss:.4f}")
# Early stopping check
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
save_checkpoint(model) # Save best model
else:
patience_counter += 1
if patience_counter >= 10:
print("Early stopping triggered")
breakCommon training problems and solutions
Vanishing gradients occur when gradients become too small to cause meaningful weight updates. This typically affects early layers in deep networks, especially with sigmoid or tanh activations. Solutions include using ReLU activations, careful initialization, batch normalization, and residual connections.
Exploding gradients cause weights to become extremely large, leading to numerical overflow. This is common in RNNs and very deep networks. Gradient clipping is the standard solution. If gradients consistently explode, reduce the learning rate or check for bugs in the network architecture.
Loss plateaus occur when the loss stops decreasing but hasn't reached a good solution. This might indicate a learning rate that's too small, a local minimum, or a saddle point. Solutions include learning rate warm restarts, using optimizers with momentum, or increasing batch size to get more accurate gradient estimates.
Overfitting manifests as training loss decreasing while validation loss increases. The model is memorizing training data rather than learning generalizable patterns. Solutions include more training data, data augmentation, dropout, weight decay, early stopping, and simpler architectures.