Neural Network Activation Functions

A Practical Guide to Neural Network Activation Functions (with Code + Intuition)

Activation functions are the non‑linear heart of neural networks. Without them, a network would collapse into a simple linear transformation, no matter how many layers you stack. In this post, we’ll walk through a complete set of activation functions implemented in NumPy, explain what each one does, and discuss when and why you’d choose it.

All examples below come from this file:

import numpy as np

def _clip(z, min_val=-500, max_val=500):
    return np.clip(z, min_val, max_val)

The _clip helper prevents numerical overflow in functions like sigmoid and softplus.


1. Linear (Identity)

def linear(z):
    return z

def linear_deriv(z):
    return np.ones_like(z)

What it does

Linear activation returns the input unchanged.

When to use it

  • Output layer of regression models
    (predicting continuous values like price, temperature, etc.)
  • Hidden layers almost never use it — it adds no nonlinearity.

Why it matters

It keeps the output unbounded, which is exactly what you want for real‑valued predictions.


2. Sigmoid

def sigmoid(z):
    z = _clip(z)
    return 1 / (1 + np.exp(-z))

def sigmoid_deriv(z):
    s = sigmoid(z)
    return s * (1 - s)

What it does

Squashes values into the range (0, 1).
Historically the first activation function used in neural nets.

When to use it

  • Binary classification output layer
    (predicting probability of class 1)
  • Rarely used in hidden layers today.

Why it matters

  • Outputs a probability-like value.
  • But suffers from vanishing gradients for large |z|.

3. Tanh

def tanh(z):
    return np.tanh(z)

def tanh_deriv(z):
    t = np.tanh(z)
    return 1 - t**2

What it does

Maps values to (-1, 1) instead of (0, 1).

When to use it

  • Hidden layers in older architectures.
  • Situations where centered activations help optimization.

Why it matters

Tanh is “zero-centered,” which often trains better than sigmoid —
but it still suffers from vanishing gradients.


4. ReLU (Rectified Linear Unit)

def relu(z):
    return np.maximum(0, z)

def relu_deriv(z):
    return (z > 0).astype(float)

What it does

Outputs:

  • z if z > 0
  • 0 otherwise

When to use it

  • Default choice for most modern neural networks
  • Deep CNNs, MLPs, transformers (with variants)

Why it matters

  • Simple
  • Fast
  • Does not saturate for positive values
  • Helps avoid vanishing gradients

The downside: neurons can “die” if they get stuck with negative inputs.


5. Leaky ReLU

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

def leaky_relu_deriv(z, alpha=0.01):
    return np.where(z > 0, 1.0, alpha)

What it does

Like ReLU, but negative values are allowed to leak through slightly.

When to use it

  • When you want ReLU’s benefits but want to avoid “dead neurons”
  • Good for deep networks with sparse activations

Why it matters

It fixes ReLU’s biggest flaw while keeping the same simplicity.


6. ELU (Exponential Linear Unit)

def elu(z, alpha=1.0):
    return np.where(z >= 0, z, alpha * (np.exp(z) - 1))

def elu_deriv(z, alpha=1.0):
    return np.where(z >= 0, 1, alpha * np.exp(z))

What it does

Smoothly blends exponential behavior for negative values with identity for positive values.

When to use it

  • Deep networks where smooth gradients help optimization
  • Tasks where ReLU is too harsh on negative values

Why it matters

  • Avoids dead neurons
  • Produces smoother gradients
  • Often trains faster than ReLU on some tasks

7. Softplus

def softplus(z):
    z = _clip(z)
    return np.log1p(np.exp(z))

def softplus_deriv(z):
    return sigmoid(z)

What it does

A smooth approximation of ReLU:

[ \text{softplus}(z) = \log(1 + e^z) ]

When to use it

  • When you want ReLU-like behavior but with smooth derivatives
  • Useful in probabilistic models (e.g., variance parameters)

Why it matters

  • Always differentiable
  • No dead neurons
  • But slower than ReLU

8. Softmax (for Multi-Class Classification)

def softmax(z):
    shift = z - np.max(z)
    exp_vals = np.exp(shift)
    return exp_vals / np.sum(exp_vals)

def softmax_deriv(z):
    s = softmax(z)
    return s  # placeholder

What it does

Turns a vector of logits into a probability distribution.

Example:

z = [2.0, 1.0, 0.1]
softmax(z) → [0.65, 0.24, 0.11]

When to use it

  • Output layer of multi-class classification
  • Paired with cross-entropy loss

Why it matters

  • Outputs valid probabilities
  • Highlights the largest logit
  • The derivative is a Jacobian, but in practice the loss simplifies it

Comments

Popular posts from this blog

How an AI Agent Works Without a Framework

Linear Regression: One Idea, Three Perspectives