Activation functions are mathematical functions used inside neural networks to decide:
“Should this neuron activate (pass information forward) or not?”
Without activation functions, a neural network would behave like a simple linear model — no matter how many layers it has.
Why Do We Need Activation Functions?
Recall the neuron formula:
y = f(Wx + b)
If we remove f( ) (the activation function), the network becomes:
y = Wx + b
Even with many layers, this remains just a linear transformation.
Real-world data is non-linear
Activation functions introduce non-linearity
This allows deep networks to:
How Activation Functions Work
Step-by-step inside a neuron:
Common Activation Functions
Below is a clean markdown table (copy-paste ready):
| Activation | Formula | Range | Used In | Pros | Cons | |------------|----------|--------|----------|------|------| | Sigmoid | 1 / (1 + e^-x) | (0,1) | Binary classification (output) | Smooth, probabilistic | Vanishing gradient | | Tanh | (e^x - e^-x)/(e^x + e^-x) | (-1,1) | Hidden layers (older networks) | Zero-centered | Vanishing gradient | | ReLU | max(0,x) | [0,∞) | Hidden layers (modern default) | Fast, simple | Dying ReLU | | Leaky ReLU | max(0.01x, x) | (-∞,∞) | Hidden layers | Fixes dying ReLU | Slight complexity | | ELU | x if x>0 else α(e^x−1) | (-α,∞) | Deep networks | Smooth, better learning | Slower | | Softmax | e^xi / Σe^xj | (0,1) sum=1 | Multi-class output | Probabilities | Only for output layer |
1️⃣ Sigmoid
Formula:
sigmoid(x) = 1 / (1 + e^(-x))
Output range:
0 to 1
Used for:
Problem:
2️⃣ Tanh
Formula:
tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Range:
-1 to 1
Better than sigmoid because:
But still suffers from:
3️⃣ ReLU (Most Important)
Formula:
ReLU(x) = max(0, x)
Range:
0 to infinity
Why it became popular:
Used in:
Example:
Problem:
4️⃣ Leaky ReLU
Formula:
LeakyReLU(x) = x if x > 0 else 0.01x
Fixes:
Allows small gradient when x < 0.
5️⃣ Softmax (Output Layer)
Formula:
Softmax(x_i) = e^(x_i) / Σ e^(x_j)
Used when:
Example:
Output:
What Is Vanishing Gradient?
When training:
If gradients become very small:
This is common in:
ReLU solved much of this.
Where To Use Which Activation?
Simple rule:
| Layer Type | Recommended Activation | |------------|------------------------| | Hidden Layers | ReLU (default choice) | | Binary Output | Sigmoid | | Multi-class Output | Softmax | | Very Deep Networks | Leaky ReLU / ELU |
PyTorch Example
import torch
import torch.nn as nn
class ActivationExample(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(4, 8)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(8, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.sigmoid(self.fc2(x))
return x
model = ActivationExample()
print(model)
Visual Intuition (Very Important)
Imagine:
ReLU makes training:
That’s why modern architectures (like those used by OpenAI) rely heavily on ReLU variants.
Compilation of All Code Blocks (Combined)
import torch
import torch.nn as nn
import math
# -------------------------
# Activation Functions
# -------------------------
def sigmoid(x):
return 1 / (1 + math.exp(-x))
def tanh(x):
return (math.exp(x) - math.exp(-x)) / (math.exp(x) + math.exp(-x))
def relu(x):
return max(0, x)
def leaky_relu(x):
return x if x > 0 else 0.01 * x
def softmax(x_list):
exps = [math.exp(i) for i in x_list]
sum_exps = sum(exps)
return [j / sum_exps for j in exps]
# -------------------------
# PyTorch Model Example
# -------------------------
class ActivationExample(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(4, 8)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(8, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.sigmoid(self.fc2(x))
return x
model = ActivationExample()
print(model)