BackPropagation

Backpropagation (short for Backward Propagation of Errors) is the algorithm that teaches a neural network how to learn.

It answers this question:

“How should we change each weight so the model makes better predictions?”

Big Picture Idea

Training a neural network has 3 main steps:

1. Forward Pass   → Make prediction
2. Compute Loss   → Measure error
3. Backward Pass  → Adjust weights (Backpropagation)

Backpropagation happens in step 3.

Step-by-Step Intuition

Imagine:

The network predicts: 0.2
The true value is: 1
Error exists

Backpropagation:

Calculates how much each weight contributed to that error
Adjusts weights slightly to reduce error next time

The Structure We’ll Use

Simple 1-hidden-layer network:

Input → Hidden → Output

Mathematically:

z1 = W1x + b1
a1 = f(z1)
z2 = W2a1 + b2
ŷ = f(z2)

Where:

f = activation function
ŷ = prediction

Step 1: Loss Function

We measure error using a loss function.

Common example (Mean Squared Error):


Loss = (ŷ - y)^2

Where:

ŷ = predicted value
y = true value

Step 2: The Core Idea of Backpropagation

We compute:

How much does each weight affect the loss?

To do this, we use:

The Chain Rule (From Calculus)

If:

Loss depends on ŷ
ŷ depends on z2
z2 depends on W2

Then:

dLoss/dW2 = dLoss/dŷ × dŷ/dz2 × dz2/dW2

This is the chain rule.

Backpropagation = applying the chain rule layer by layer, backwards.

Why “Back” Propagation?

Because we compute gradients:

Output layer → Hidden layer → Input layer

We go backward through the network.

⚙️ Step 3: Gradient Descent Update

After computing gradient:

W = W - learning_rate * gradient

This moves weights in direction that reduces loss.

Complete Training Flow

| Step | What Happens |
|------|--------------|
| 1 | Forward pass |
| 2 | Compute loss |
| 3 | Compute gradients (backpropagation) |
| 4 | Update weights |
| 5 | Repeat |

Intuitive Example (Single Neuron)

Neuron:

ŷ = sigmoid(wx + b)
Loss = (ŷ - y)^2

Backprop finds:

∂Loss/∂w
∂Loss/∂b

Then updates:

w = w - lr * ∂Loss/∂w
b = b - lr * ∂Loss/∂b

Why Activation Functions Matter Here

Backprop requires derivatives.

Example:

Sigmoid derivative:

sigmoid'(x) = sigmoid(x) * (1 - sigmoid(x))

ReLU derivative:

ReLU'(x) = 1 if x > 0 else 0

If derivative is very small → vanishing gradient problem

This is why ReLU became popular in deep models used by

Google and

OpenAI.

Mathematical Flow

For output layer:

δ_output = (ŷ - y) * f'(z2)

For hidden layer:

δ_hidden = (W2^T * δ_output) * f'(z1)

Weight gradients:

dW2 = δ_output * a1^T
dW1 = δ_hidden * x^T

Update:

W = W - lr * dW

Simple Backprop Example

This is a tiny 1-neuron example:

import math

# Sigmoid function
def sigmoid(x):
    return 1 / (1 + math.exp(-x))

# Derivative
def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

# Training example
x = 2
y_true = 1

# Initialize weights
w = 0.5
b = 0.0
lr = 0.1

# Forward pass
z = w*x + b
y_pred = sigmoid(z)

# Loss derivative
dL_dy = 2 * (y_pred - y_true)

# Backprop
dL_dz = dL_dy * sigmoid_derivative(z)
dL_dw = dL_dz * x
dL_db = dL_dz

# Update
w -= lr * dL_dw
b -= lr * dL_db

print("Updated weight:", w)
print("Updated bias:", b)

Backprop in PyTorch (Automatic Differentiation)

Modern libraries compute gradients automatically.

import torch
import torch.nn as nn

# Simple model
model = nn.Linear(1, 1)

# Data
x = torch.tensor([[2.0]])
y = torch.tensor([[1.0]])

# Loss
criterion = nn.MSELoss()

# Optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Forward
y_pred = model(x)
loss = criterion(y_pred, y)

# Backward (Backprop happens here)
loss.backward()

# Update
optimizer.step()

print("Updated weight:", list(model.parameters()))

Notice:

loss.backward()

That single line performs entire backpropagation automatically.

Common Problems in Backpropagation

| Problem | Cause | Solution |
|----------|--------|-----------|
| Vanishing gradient | Small derivatives (Sigmoid/Tanh) | Use ReLU |
| Exploding gradient | Very large gradients | Gradient clipping |
| Slow convergence | Poor learning rate | Use Adam optimizer |

Why Backpropagation Is Powerful

Without backprop:

Deep networks would not train
Transformers wouldn’t work
CNNs wouldn’t learn images
LLMs wouldn’t exist

Backpropagation is the engine behind modern AI systems.

Compilation of All Code Blocks (Combined)

Example Code:

import math
import torch
import torch.nn as nn

# -----------------------------
# Manual Backprop (1 Neuron)
# -----------------------------

def sigmoid(x):
    return 1 / (1 + math.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

x = 2
y_true = 1

w = 0.5
b = 0.0
lr = 0.1

# Forward
z = w*x + b
y_pred = sigmoid(z)

# Loss derivative
dL_dy = 2 * (y_pred - y_true)

# Backprop
dL_dz = dL_dy * sigmoid_derivative(z)
dL_dw = dL_dz * x
dL_db = dL_dz

# Update
w -= lr * dL_dw
b -= lr * dL_db

print("Manual Backprop Updated weight:", w)
print("Manual Backprop Updated bias:", b)

# -----------------------------
# PyTorch Automatic Backprop
# -----------------------------

model = nn.Linear(1, 1)

x_torch = torch.tensor([[2.0]])
y_torch = torch.tensor([[1.0]])

criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

y_pred = model(x_torch)
loss = criterion(y_pred, y_torch)

loss.backward()
optimizer.step()

print("PyTorch Updated Parameters:", list(model.parameters()))

Deep Learning

BackPropagation

Example Code:

Deep Learning

All Courses