Neural Network Activation Functions
A Practical Guide to Neural Network Activation Functions (with Code + Intuition)
Activation functions are the non‑linear heart of neural networks. Without them, a network would collapse into a simple linear transformation, no matter how many layers you stack. In this post, we’ll walk through a complete set of activation functions implemented in NumPy, explain what each one does, and discuss when and why you’d choose it.
All examples below come from this file:
import numpy as np
def _clip(z, min_val=-500, max_val=500):
return np.clip(z, min_val, max_val)
The _clip helper prevents numerical overflow in functions like sigmoid and softplus.
1. Linear (Identity)
def linear(z):
return z
def linear_deriv(z):
return np.ones_like(z)
What it does
Linear activation returns the input unchanged.
When to use it
- Output layer of regression models
(predicting continuous values like price, temperature, etc.) - Hidden layers almost never use it — it adds no nonlinearity.
Why it matters
It keeps the output unbounded, which is exactly what you want for real‑valued predictions.
2. Sigmoid
def sigmoid(z):
z = _clip(z)
return 1 / (1 + np.exp(-z))
def sigmoid_deriv(z):
s = sigmoid(z)
return s * (1 - s)
What it does
Squashes values into the range (0, 1).
Historically the first activation function used in neural nets.
When to use it
- Binary classification output layer
(predicting probability of class 1) - Rarely used in hidden layers today.
Why it matters
- Outputs a probability-like value.
- But suffers from vanishing gradients for large |z|.
3. Tanh
def tanh(z):
return np.tanh(z)
def tanh_deriv(z):
t = np.tanh(z)
return 1 - t**2
What it does
Maps values to (-1, 1) instead of (0, 1).
When to use it
- Hidden layers in older architectures.
- Situations where centered activations help optimization.
Why it matters
Tanh is “zero-centered,” which often trains better than sigmoid —
but it still suffers from vanishing gradients.
4. ReLU (Rectified Linear Unit)
def relu(z):
return np.maximum(0, z)
def relu_deriv(z):
return (z > 0).astype(float)
What it does
Outputs:
- z if z > 0
- 0 otherwise
When to use it
- Default choice for most modern neural networks
- Deep CNNs, MLPs, transformers (with variants)
Why it matters
- Simple
- Fast
- Does not saturate for positive values
- Helps avoid vanishing gradients
The downside: neurons can “die” if they get stuck with negative inputs.
5. Leaky ReLU
def leaky_relu(z, alpha=0.01):
return np.where(z > 0, z, alpha * z)
def leaky_relu_deriv(z, alpha=0.01):
return np.where(z > 0, 1.0, alpha)
What it does
Like ReLU, but negative values are allowed to leak through slightly.
When to use it
- When you want ReLU’s benefits but want to avoid “dead neurons”
- Good for deep networks with sparse activations
Why it matters
It fixes ReLU’s biggest flaw while keeping the same simplicity.
6. ELU (Exponential Linear Unit)
def elu(z, alpha=1.0):
return np.where(z >= 0, z, alpha * (np.exp(z) - 1))
def elu_deriv(z, alpha=1.0):
return np.where(z >= 0, 1, alpha * np.exp(z))
What it does
Smoothly blends exponential behavior for negative values with identity for positive values.
When to use it
- Deep networks where smooth gradients help optimization
- Tasks where ReLU is too harsh on negative values
Why it matters
- Avoids dead neurons
- Produces smoother gradients
- Often trains faster than ReLU on some tasks
7. Softplus
def softplus(z):
z = _clip(z)
return np.log1p(np.exp(z))
def softplus_deriv(z):
return sigmoid(z)
What it does
A smooth approximation of ReLU:
[ \text{softplus}(z) = \log(1 + e^z) ]
When to use it
- When you want ReLU-like behavior but with smooth derivatives
- Useful in probabilistic models (e.g., variance parameters)
Why it matters
- Always differentiable
- No dead neurons
- But slower than ReLU
8. Softmax (for Multi-Class Classification)
def softmax(z):
shift = z - np.max(z)
exp_vals = np.exp(shift)
return exp_vals / np.sum(exp_vals)
def softmax_deriv(z):
s = softmax(z)
return s # placeholder
What it does
Turns a vector of logits into a probability distribution.
Example:
z = [2.0, 1.0, 0.1]
softmax(z) → [0.65, 0.24, 0.11]
When to use it
- Output layer of multi-class classification
- Paired with cross-entropy loss
Why it matters
- Outputs valid probabilities
- Highlights the largest logit
- The derivative is a Jacobian, but in practice the loss simplifies it
Comments
Post a Comment