Activation Functions

Activation functions are mathematical functions used inside neural networks to decide:

“Should this neuron activate (pass information forward) or not?”

Without activation functions, a neural network would behave like a simple linear model — no matter how many layers it has.

Why Do We Need Activation Functions?

Recall the neuron formula:


y = f(Wx + b)

If we remove f( ) (the activation function), the network becomes:


y = Wx + b

Even with many layers, this remains just a linear transformation.

Real-world data is non-linear

Activation functions introduce non-linearity

This allows deep networks to:

Recognize images
Understand speech
Translate languages
Detect fraud

How Activation Functions Work

Step-by-step inside a neuron:

Multiply inputs by weights
Add bias
Apply activation function
Pass output to next layer

Common Activation Functions

Below is a clean markdown table (copy-paste ready):

| Activation | Formula | Range | Used In | Pros | Cons |
|------------|----------|--------|----------|------|------|
| Sigmoid | 1 / (1 + e^-x) | (0,1) | Binary classification (output) | Smooth, probabilistic | Vanishing gradient |
| Tanh | (e^x - e^-x)/(e^x + e^-x) | (-1,1) | Hidden layers (older networks) | Zero-centered | Vanishing gradient |
| ReLU | max(0,x) | [0,∞) | Hidden layers (modern default) | Fast, simple | Dying ReLU |
| Leaky ReLU | max(0.01x, x) | (-∞,∞) | Hidden layers | Fixes dying ReLU | Slight complexity |
| ELU | x if x>0 else α(e^x−1) | (-α,∞) | Deep networks | Smooth, better learning | Slower |
| Softmax | e^xi / Σe^xj | (0,1) sum=1 | Multi-class output | Probabilities | Only for output layer |

1️⃣ Sigmoid

Formula:

sigmoid(x) = 1 / (1 + e^(-x))

Output range:

0 to 1

Used for:

Binary classification output layer

Problem:

Vanishing gradient (gradients become very small)

2️⃣ Tanh

Formula:

tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Range:

-1 to 1

Better than sigmoid because:

Zero-centered

But still suffers from:

Vanishing gradient

3️⃣ ReLU (Most Important)

Formula:

ReLU(x) = max(0, x)

Range:

0 to infinity

Why it became popular:

Fast computation
Prevents vanishing gradient (mostly)
Works well in deep networks

Used in:

CNNs
Transformers
Most modern models

Example:

Google popularized ReLU in deep vision networks.

Problem:

Dying ReLU (neurons stuck at 0)

4️⃣ Leaky ReLU

Formula:


LeakyReLU(x) = x if x > 0 else 0.01x

Fixes:

Dying ReLU

Allows small gradient when x < 0.

5️⃣ Softmax (Output Layer)

Formula:


Softmax(x_i) = e^(x_i) / Σ e^(x_j)

Used when:

Multi-class classification

Example:

Digit recognition (0–9)
Image classification

Output:

All values between 0 and 1
Sum = 1 (probabilities)

What Is Vanishing Gradient?

When training:

We compute gradients
Gradients update weights

If gradients become very small:

Learning slows
Deep layers stop learning

This is common in:

Sigmoid
Tanh

ReLU solved much of this.

Where To Use Which Activation?

Simple rule:


| Layer Type | Recommended Activation |
|------------|------------------------|
| Hidden Layers | ReLU (default choice) |
| Binary Output | Sigmoid |
| Multi-class Output | Softmax |
| Very Deep Networks | Leaky ReLU / ELU |

PyTorch Example

import torch
import torch.nn as nn

class ActivationExample(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 8)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(8, 1)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.sigmoid(self.fc2(x))
        return x

model = ActivationExample()
print(model)

Visual Intuition (Very Important)

Imagine:

Sigmoid → Smooth S-curve
Tanh → Bigger S-curve centered at 0
ReLU → Flat for negatives, linear for positives

ReLU makes training:

Faster
More stable
Scalable to deep models

That’s why modern architectures (like those used by OpenAI) rely heavily on ReLU variants.

Compilation of All Code Blocks (Combined)

Example Code:

import torch
import torch.nn as nn
import math

# -------------------------
# Activation Functions
# -------------------------

def sigmoid(x):
    return 1 / (1 + math.exp(-x))

def tanh(x):
    return (math.exp(x) - math.exp(-x)) / (math.exp(x) + math.exp(-x))

def relu(x):
    return max(0, x)

def leaky_relu(x):
    return x if x > 0 else 0.01 * x

def softmax(x_list):
    exps = [math.exp(i) for i in x_list]
    sum_exps = sum(exps)
    return [j / sum_exps for j in exps]

# -------------------------
# PyTorch Model Example
# -------------------------

class ActivationExample(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 8)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(8, 1)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.sigmoid(self.fc2(x))
        return x

model = ActivationExample()
print(model)

Deep Learning

Activation Functions

Example Code:

Deep Learning

All Courses