Activation Functions | Deep Learning Tutorial - Learn with VOKS
Back Next

Activation Functions


Activation functions are mathematical functions used inside neural networks to decide:

“Should this neuron activate (pass information forward) or not?”

Without activation functions, a neural network would behave like a simple linear model — no matter how many layers it has.


Why Do We Need Activation Functions?

Recall the neuron formula:


y = f(Wx + b)

If we remove f( ) (the activation function), the network becomes:


y = Wx + b

Even with many layers, this remains just a linear transformation.

Real-world data is non-linear

Activation functions introduce non-linearity

This allows deep networks to:

  • Recognize images
  • Understand speech
  • Translate languages
  • Detect fraud

How Activation Functions Work

Step-by-step inside a neuron:

  1. Multiply inputs by weights
  2. Add bias
  3. Apply activation function
  4. Pass output to next layer

Common Activation Functions

Below is a clean markdown table (copy-paste ready):

| Activation | Formula | Range | Used In | Pros | Cons |
|------------|----------|--------|----------|------|------|
| Sigmoid | 1 / (1 + e^-x) | (0,1) | Binary classification (output) | Smooth, probabilistic | Vanishing gradient |
| Tanh | (e^x - e^-x)/(e^x + e^-x) | (-1,1) | Hidden layers (older networks) | Zero-centered | Vanishing gradient |
| ReLU | max(0,x) | [0,∞) | Hidden layers (modern default) | Fast, simple | Dying ReLU |
| Leaky ReLU | max(0.01x, x) | (-∞,∞) | Hidden layers | Fixes dying ReLU | Slight complexity |
| ELU | x if x>0 else α(e^x−1) | (-α,∞) | Deep networks | Smooth, better learning | Slower |
| Softmax | e^xi / Σe^xj | (0,1) sum=1 | Multi-class output | Probabilities | Only for output layer |

1️⃣ Sigmoid

Formula:

sigmoid(x) = 1 / (1 + e^(-x))

Output range:

0 to 1

Used for:

  • Binary classification output layer

Problem:

  • Vanishing gradient (gradients become very small)

2️⃣ Tanh

Formula:

tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Range:

-1 to 1

Better than sigmoid because:

  • Zero-centered

But still suffers from:

  • Vanishing gradient

3️⃣ ReLU (Most Important)

Formula:

ReLU(x) = max(0, x)

Range:

0 to infinity

Why it became popular:

  • Fast computation
  • Prevents vanishing gradient (mostly)
  • Works well in deep networks

Used in:

  • CNNs
  • Transformers
  • Most modern models

Example:

  • Google popularized ReLU in deep vision networks.

Problem:

  • Dying ReLU (neurons stuck at 0)

4️⃣ Leaky ReLU

Formula:


LeakyReLU(x) = x if x > 0 else 0.01x

Fixes:

  • Dying ReLU

Allows small gradient when x < 0.


5️⃣ Softmax (Output Layer)

Formula:


Softmax(x_i) = e^(x_i) / Σ e^(x_j)

Used when:

  • Multi-class classification

Example:

  • Digit recognition (0–9)
  • Image classification

Output:

  • All values between 0 and 1
  • Sum = 1 (probabilities)

What Is Vanishing Gradient?

When training:

  • We compute gradients
  • Gradients update weights

If gradients become very small:

  • Learning slows
  • Deep layers stop learning

This is common in:

  • Sigmoid
  • Tanh

ReLU solved much of this.


Where To Use Which Activation?

Simple rule:


| Layer Type | Recommended Activation |
|------------|------------------------|
| Hidden Layers | ReLU (default choice) |
| Binary Output | Sigmoid |
| Multi-class Output | Softmax |
| Very Deep Networks | Leaky ReLU / ELU |

PyTorch Example

import torch
import torch.nn as nn

class ActivationExample(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 8)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(8, 1)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.sigmoid(self.fc2(x))
        return x

model = ActivationExample()
print(model)

Visual Intuition (Very Important)

Imagine:

  • Sigmoid → Smooth S-curve
  • Tanh → Bigger S-curve centered at 0
  • ReLU → Flat for negatives, linear for positives

ReLU makes training:

  • Faster
  • More stable
  • Scalable to deep models

That’s why modern architectures (like those used by OpenAI) rely heavily on ReLU variants.


Compilation of All Code Blocks (Combined)


Example Code:
import torch
import torch.nn as nn
import math

# -------------------------
# Activation Functions
# -------------------------

def sigmoid(x):
    return 1 / (1 + math.exp(-x))

def tanh(x):
    return (math.exp(x) - math.exp(-x)) / (math.exp(x) + math.exp(-x))

def relu(x):
    return max(0, x)

def leaky_relu(x):
    return x if x > 0 else 0.01 * x

def softmax(x_list):
    exps = [math.exp(i) for i in x_list]
    sum_exps = sum(exps)
    return [j / sum_exps for j in exps]

# -------------------------
# PyTorch Model Example
# -------------------------

class ActivationExample(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 8)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(8, 1)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.sigmoid(self.fc2(x))
        return x

model = ActivationExample()
print(model)
Deep Learning
Architecture Activation Functions BackPropagation Image Recognition Natural Language Processing (NLP) with Deep Learning Time Series Forecasting Autoencoders Generative Adversarial Networks (GANs)
All Courses
Advance AI Bootstrap C C++ Computer Vision Content Writing CSS Cyber Security Data Analysis Deep Learning Email Marketing Excel Figma HTML Java Script Machine Learning MySQLi Node JS PHP Power Bi Python Python for AI Python for Analysis React React Native SEO SMM SQL