Backpropagation (short for Backward Propagation of Errors) is the algorithm that teaches a neural network how to learn.
It answers this question:
“How should we change each weight so the model makes better predictions?”
Big Picture Idea
Training a neural network has 3 main steps:
1. Forward Pass → Make prediction 2. Compute Loss → Measure error 3. Backward Pass → Adjust weights (Backpropagation)
Backpropagation happens in step 3.
Step-by-Step Intuition
Imagine:
Backpropagation:
The Structure We’ll Use
Simple 1-hidden-layer network:
Input → Hidden → Output
Mathematically:
z1 = W1x + b1 a1 = f(z1) z2 = W2a1 + b2 ŷ = f(z2)
Where:
Step 1: Loss Function
We measure error using a loss function.
Common example (Mean Squared Error):
Loss = (ŷ - y)^2
Where:
Step 2: The Core Idea of Backpropagation
We compute:
How much does each weight affect the loss?
To do this, we use:
The Chain Rule (From Calculus)
If:
Loss depends on ŷ ŷ depends on z2 z2 depends on W2
Then:
dLoss/dW2 = dLoss/dŷ × dŷ/dz2 × dz2/dW2
This is the chain rule.
Backpropagation = applying the chain rule layer by layer, backwards.
Why “Back” Propagation?
Because we compute gradients:
Output layer → Hidden layer → Input layer
We go backward through the network.
⚙️ Step 3: Gradient Descent Update
After computing gradient:
W = W - learning_rate * gradient
This moves weights in direction that reduces loss.
Complete Training Flow
| Step | What Happens | |------|--------------| | 1 | Forward pass | | 2 | Compute loss | | 3 | Compute gradients (backpropagation) | | 4 | Update weights | | 5 | Repeat |
Intuitive Example (Single Neuron)
Neuron:
ŷ = sigmoid(wx + b) Loss = (ŷ - y)^2
Backprop finds:
∂Loss/∂w ∂Loss/∂b
Then updates:
w = w - lr * ∂Loss/∂w b = b - lr * ∂Loss/∂b
Why Activation Functions Matter Here
Backprop requires derivatives.
Example:
Sigmoid derivative:
sigmoid'(x) = sigmoid(x) * (1 - sigmoid(x))
ReLU derivative:
ReLU'(x) = 1 if x > 0 else 0
If derivative is very small → vanishing gradient problem
This is why ReLU became popular in deep models used by
Google and
OpenAI.
Mathematical Flow
For output layer:
δ_output = (ŷ - y) * f'(z2)
For hidden layer:
δ_hidden = (W2^T * δ_output) * f'(z1)
Weight gradients:
dW2 = δ_output * a1^T dW1 = δ_hidden * x^T
Update:
W = W - lr * dW
Simple Backprop Example
This is a tiny 1-neuron example:
import math
# Sigmoid function
def sigmoid(x):
return 1 / (1 + math.exp(-x))
# Derivative
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s)
# Training example
x = 2
y_true = 1
# Initialize weights
w = 0.5
b = 0.0
lr = 0.1
# Forward pass
z = w*x + b
y_pred = sigmoid(z)
# Loss derivative
dL_dy = 2 * (y_pred - y_true)
# Backprop
dL_dz = dL_dy * sigmoid_derivative(z)
dL_dw = dL_dz * x
dL_db = dL_dz
# Update
w -= lr * dL_dw
b -= lr * dL_db
print("Updated weight:", w)
print("Updated bias:", b)
Backprop in PyTorch (Automatic Differentiation)
Modern libraries compute gradients automatically.
import torch
import torch.nn as nn
# Simple model
model = nn.Linear(1, 1)
# Data
x = torch.tensor([[2.0]])
y = torch.tensor([[1.0]])
# Loss
criterion = nn.MSELoss()
# Optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
# Forward
y_pred = model(x)
loss = criterion(y_pred, y)
# Backward (Backprop happens here)
loss.backward()
# Update
optimizer.step()
print("Updated weight:", list(model.parameters()))
Notice:
loss.backward()
That single line performs entire backpropagation automatically.
Common Problems in Backpropagation
| Problem | Cause | Solution | |----------|--------|-----------| | Vanishing gradient | Small derivatives (Sigmoid/Tanh) | Use ReLU | | Exploding gradient | Very large gradients | Gradient clipping | | Slow convergence | Poor learning rate | Use Adam optimizer |
Why Backpropagation Is Powerful
Without backprop:
Backpropagation is the engine behind modern AI systems.
Compilation of All Code Blocks (Combined)
import math
import torch
import torch.nn as nn
# -----------------------------
# Manual Backprop (1 Neuron)
# -----------------------------
def sigmoid(x):
return 1 / (1 + math.exp(-x))
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s)
x = 2
y_true = 1
w = 0.5
b = 0.0
lr = 0.1
# Forward
z = w*x + b
y_pred = sigmoid(z)
# Loss derivative
dL_dy = 2 * (y_pred - y_true)
# Backprop
dL_dz = dL_dy * sigmoid_derivative(z)
dL_dw = dL_dz * x
dL_db = dL_dz
# Update
w -= lr * dL_dw
b -= lr * dL_db
print("Manual Backprop Updated weight:", w)
print("Manual Backprop Updated bias:", b)
# -----------------------------
# PyTorch Automatic Backprop
# -----------------------------
model = nn.Linear(1, 1)
x_torch = torch.tensor([[2.0]])
y_torch = torch.tensor([[1.0]])
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
y_pred = model(x_torch)
loss = criterion(y_pred, y_torch)
loss.backward()
optimizer.step()
print("PyTorch Updated Parameters:", list(model.parameters()))