A Practical Guide to Neural Network Activation Functions (with Code + Intuition)

Activation functions are the non‑linear heart of neural networks. Without them, a network would collapse into a simple linear transformation, no matter how many layers you stack. In this post, we’ll walk through a complete set of activation functions implemented in NumPy, explain what each one does, and discuss when and why you’d choose it.

All examples below come from this file:

import numpy as np

def _clip(z, min_val=-500, max_val=500):
    return np.clip(z, min_val, max_val)

The _clip helper prevents numerical overflow in functions like sigmoid and softplus.

1. Linear (Identity)

def linear(z):
    return z

def linear_deriv(z):
    return np.ones_like(z)

What it does

Linear activation returns the input unchanged.

When to use it

Output layer of regression models
(predicting continuous values like price, temperature, etc.)
Hidden layers almost never use it — it adds no nonlinearity.

Why it matters

It keeps the output unbounded, which is exactly what you want for real‑valued predictions.

2. Sigmoid

def sigmoid(z):
    z = _clip(z)
    return 1 / (1 + np.exp(-z))

def sigmoid_deriv(z):
    s = sigmoid(z)
    return s * (1 - s)

What it does

Squashes values into the range (0, 1).
Historically the first activation function used in neural nets.

When to use it

Binary classification output layer
(predicting probability of class 1)
Rarely used in hidden layers today.

Why it matters

Outputs a probability-like value.
But suffers from vanishing gradients for large |z|.

3. Tanh

def tanh(z):
    return np.tanh(z)

def tanh_deriv(z):
    t = np.tanh(z)
    return 1 - t**2

What it does

Maps values to (-1, 1) instead of (0, 1).

When to use it

Hidden layers in older architectures.
Situations where centered activations help optimization.

Why it matters

Tanh is “zero-centered,” which often trains better than sigmoid —
but it still suffers from vanishing gradients.

4. ReLU (Rectified Linear Unit)

def relu(z):
    return np.maximum(0, z)

def relu_deriv(z):
    return (z > 0).astype(float)

What it does

Outputs:

z if z > 0
0 otherwise

When to use it

Default choice for most modern neural networks
Deep CNNs, MLPs, transformers (with variants)

Why it matters

Simple
Fast
Does not saturate for positive values
Helps avoid vanishing gradients

The downside: neurons can “die” if they get stuck with negative inputs.

5. Leaky ReLU

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

def leaky_relu_deriv(z, alpha=0.01):
    return np.where(z > 0, 1.0, alpha)

What it does

Like ReLU, but negative values are allowed to leak through slightly.

When to use it

When you want ReLU’s benefits but want to avoid “dead neurons”
Good for deep networks with sparse activations

Why it matters

It fixes ReLU’s biggest flaw while keeping the same simplicity.

6. ELU (Exponential Linear Unit)

def elu(z, alpha=1.0):
    return np.where(z >= 0, z, alpha * (np.exp(z) - 1))

def elu_deriv(z, alpha=1.0):
    return np.where(z >= 0, 1, alpha * np.exp(z))

What it does

Smoothly blends exponential behavior for negative values with identity for positive values.

When to use it

Deep networks where smooth gradients help optimization
Tasks where ReLU is too harsh on negative values

Why it matters

Avoids dead neurons
Produces smoother gradients
Often trains faster than ReLU on some tasks

7. Softplus

def softplus(z):
    z = _clip(z)
    return np.log1p(np.exp(z))

def softplus_deriv(z):
    return sigmoid(z)

What it does

A smooth approximation of ReLU:

[ \text{softplus}(z) = \log(1 + e^z) ]

When to use it

When you want ReLU-like behavior but with smooth derivatives
Useful in probabilistic models (e.g., variance parameters)

Why it matters

Always differentiable
No dead neurons
But slower than ReLU

8. Softmax (for Multi-Class Classification)

def softmax(z):
    shift = z - np.max(z)
    exp_vals = np.exp(shift)
    return exp_vals / np.sum(exp_vals)

def softmax_deriv(z):
    s = softmax(z)
    return s  # placeholder

What it does

Turns a vector of logits into a probability distribution.

Example:

z = [2.0, 1.0, 0.1]
softmax(z) → [0.65, 0.24, 0.11]

When to use it

Output layer of multi-class classification
Paired with cross-entropy loss

Why it matters

Outputs valid probabilities
Highlights the largest logit
The derivative is a Jacobian, but in practice the loss simplifies it

Neural Network Activation Functions

A Practical Guide to Neural Network Activation Functions (with Code + Intuition)

1. Linear (Identity)

What it does

When to use it

Why it matters

2. Sigmoid

What it does

When to use it

Why it matters

3. Tanh

What it does

When to use it

Why it matters

4. ReLU (Rectified Linear Unit)

What it does

When to use it

Why it matters

5. Leaky ReLU

What it does

When to use it

Why it matters

6. ELU (Exponential Linear Unit)

What it does

When to use it

Why it matters

7. Softplus

What it does

When to use it

Why it matters

8. Softmax (for Multi-Class Classification)

What it does

When to use it

Why it matters

Comments

Post a Comment

Popular posts from this blog