Project 4: Solving the XOR Problem

Why Linear Models Fail, and Why Neural Networks Win

The XOR problem is the smallest, cleanest demonstration of why neural networks must be nonlinear. It exposes the fundamental limitation of linear models and shows how hidden layers create new features that make the impossible solvable.

This project introduces:

nonlinear feature construction
hidden neurons as feature detectors
backpropagation through multiple layers
cross‑entropy loss for classification

By the end, you’ll see a neural network invent the features needed to solve XOR.

The inputs are all possible pairs of binary values:

(0, 0)

(0, 1)

(1, 0)

(1, 1)

The outputs follow the XOR truth table:

Output is 1 only when the inputs are different
Output is 0 when the inputs are the same

XOR Dataset

# XOR dataset

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

y = np.array([0, 1, 1, 0])

Initializing the Network Parameters

np.random.seed(42)

hidden_weights = np.random.randn(2, 2)

hidden_biases = np.random.randn(2)

output_weights = np.random.randn(2)

output_bias = np.random.randn()

What these shapes mean

hidden_weights: shape (2, 2)
Maps 2 inputs → 2 hidden neurons
hidden_biases: shape (2)
One bias per hidden neuron
output_weights: shape (2)
Takes the 2 hidden activations → 1 output neuron
output_bias: scalar
Bias for the output neuron

Random initialization ensures each neuron starts differently so they can learn different features.

Activation Function

sigmoid = lambda z: 1 / (1 + np.exp(-z))

sigmoid_deriv = lambda y: y * (1 - y)

Sigmoid squashes values into the range (0, 1).
Its derivative is simple and works well for small networks.

Hyperparameters and Traces

lr = 0.1

epochs = 10000

loss_history = []

weight_trace = []

prediction_trace = defaultdict(list)

hidden_activation_trace = defaultdict(list)

interval = 100

lr: learning rate
epochs: number of full passes through the dataset
loss_history: track training progress
weight_trace: track how output weights change
prediction_trace: store predictions over time
hidden_activation_trace: store hidden neuron activations
interval: record snapshots every 100 epochs

These traces help visualize how the network learns XOR.

for epoch in range(epochs):
   total_loss = 0

   for i in range(4):
       x_sample = X[i]
       target_val = y[i]

       # Forward pass: Hidden layer
       z_hidden = hidden_weights @ x_sample + hidden_biases
       hidden_outputs = sigmoid(z_hidden)

       # Forward pass: Output layer
       z_output = output_weights @ hidden_outputs + output_bias
       y_hat = sigmoid(z_output)

       # Cross-Entropy Loss
       eps = 1e-10
       loss = - (target_val * np.log(y_hat + eps) +
       (1 - target_val) * np.log(1 - y_hat + eps))
       total_loss += loss

       # Cross-Entropy Output Gradient (simplified)
       dL_dz_output = y_hat - target_val

       # Output layer gradients
       dL_dw_output = dL_dz_output * hidden_outputs
       dL_db_output = dL_dz_output

       # Backprop: Hidden layer
       d_hidden = sigmoid_deriv(hidden_outputs)
       dL_dz_hidden = output_weights * dL_dz_output * d_hidden
       dL_dw_hidden = np.outer(dL_dz_hidden, x_sample)
       dL_db_hidden = dL_dz_hidden

       # Gradient descent update
       output_weights -= lr * dL_dw_output
       output_bias -= lr * dL_db_output
       hidden_weights -= lr * dL_dw_hidden
       hidden_biases -= lr * dL_db_hidden

       weight_trace.append(output_weights.copy())

   # End of epoch
   loss_history.append(total_loss / 4) # average loss per sample

   if epoch % interval == 0:
       for i in range(4):
           x_sample = X[i]
           z_hidden = hidden_weights @ x_sample + hidden_biases
           hidden_outputs = sigmoid(z_hidden)
           z_output = output_weights @ hidden_outputs + output_bias
           y_pred = sigmoid(z_output)
           prediction_trace[i].append(y_pred)
           hidden_activation_trace[i].append(hidden_outputs.copy())

Training Loop and Forward Pass

What’s happening here?

Multiply inputs by hidden weights
Add hidden biases
Apply sigmoid
Multiply hidden activations by output weights
Add output bias
Apply sigmoid again

This produces the network’s prediction y_hat.

This is the same structure as logistic regression — just stacked twice.

for epoch in range(epochs):
   total_loss = 0
   for i in range(4):
       x_sample = X[i]
       target_val = y[i]

       # Forward pass: Hidden layer
       z_hidden = hidden_weights @ x_sample + hidden_biases
       hidden_outputs = sigmoid(z_hidden)

       # Forward pass: Output layer
       z_output = output_weights @ hidden_outputs + output_bias
       y_hat = sigmoid(z_output)

Loss Functions: MSE vs Cross‑Entropy

This is where the project becomes especially educational.

XOR is a classification problem.
The output neuron is a logistic unit.
The correct loss is binary cross‑entropy.

But many beginners start with MSE as I did, so let’s compare them directly.

Mean Squared Error (MSE)
MSE works, but it fights the sigmoid’s curvature and slows learning.

```
MSE Loss

loss = 0.5 * (y_hat - target_val) ** 2

MSE Output Gradient

dL_dyhat = y_hat - target_val
dyhat_dz = sigmoid_deriv(y_hat)
dL_dz_output = dL_dyhat * dyhat_dz

```
Problems with MSE for classification:

gradients vanish when sigmoid saturates
learning is slower
optimization landscape is less smooth

Binary Cross‑Entropy (Recommended)

Cross‑entropy is the mathematically correct loss for probability models.

Cross-Entropy Loss

eps = 1e-10
loss = - (target_val * np.log(y_hat + eps) +
(1 - target_val) * np.log(1 - y_hat + eps))

Cross-Entropy Output Gradient (simplified)
dL_dz_output = y_hat - target_val

Why cross‑entropy is better:

gradient simplifies beautifully
no sigmoid derivative needed
faster, more stable learning
matches logistic regression theory

Side‑by‑Side Comparison

Step MSE Cross‑Entropy

Loss 0.5*(y_hat - y)^2 -y log(y_hat) - (1-y) log(1-y_hat)
dL/dŷ y_hat - y (not needed)
dŷ/dz sigmoid_deriv(y_hat) (not needed)
dL/dz (y_hat - y) * sigmoid_deriv(y_hat) y_hat - y
Best regression classification

Backpropagation: Output Layer

This computes:

How much the output changed the loss
How much the output weights contributed
How much the output bias contributed

Output Layer : MSE

dL_dyhat = y_hat - target_val
dyhat_dz = sigmoid_deriv(y_hat)
dL_dz_output = dL_dyhat * dyhat_dz

Output Layer : Cross-Entropy

dL_dz_output = y_hat - target_val

that is the only change

dL_dw_output = dL_dz_output * hidden_outputs
dL_db_output = dL_dz_output

Backpropagation: Hidden Layer

d_hidden = sigmoid_deriv(hidden_outputs)

dL_dz_hidden = output_weights * dL_dz_output * d_hidden

dL_dw_hidden = np.outer(dL_dz_hidden, x_sample)

dL_db_hidden = dL_dz_hidden

This applies the chain rule:

The output error flows backward through the output weights
Each hidden neuron receives part of that error
Multiply by the derivative of the hidden activation
Compute gradients for hidden weights and biases

This is the core of backpropagation.

Gradient Descent Updates

output_weights -= lr * dL_dw_output

output_bias -= lr * dL_db_output

hidden_weights -= lr * dL_dw_hidden

hidden_biases -= lr * dL_db_hidden

weight_trace.append(output_weights.copy())

Each parameter moves a small step in the direction that reduces the loss.

Epoch Loss and Periodic Evaluation

loss_history.append(total_loss / 4)

if epoch % interval == 0:

for i in range(4):

x_sample = X[i]

z_hidden = hidden_weights @ x_sample + hidden_biases

hidden_outputs = sigmoid(z_hidden)

z_output = output_weights @ hidden_outputs + output_bias

y_pred = sigmoid(z_output)

prediction_trace[i].append(y_pred)

hidden_activation_trace[i].append(hidden_outputs.copy())

This records:

Loss over time
Predictions for each XOR input
Hidden neuron activations

These traces reveal how the network learns nonlinear features.

Visualization

fig, axs = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Loss Curve

axs[0, 0].plot(loss_history)

axs[0, 0].set_title("XOR Training Loss Over Epochs")

axs[0, 0].set_xlabel("Epoch")

axs[0, 0].set_ylabel("Loss")

axs[0, 0].grid(True)

# Plot 2: Output Weight Updates

weight_trace_arr = np.array(weight_trace)

axs[0, 1].plot(weight_trace_arr[:, 0], label='Output Weight 1')

axs[0, 1].plot(weight_trace_arr[:, 1], label='Output Weight 2')

axs[0, 1].legend()

axs[0, 1].set_title("Output Weight Updates Over Training Steps")

axs[0, 1].set_xlabel("Update Step")

axs[0, 1].set_ylabel("Weight Value")

axs[0, 1].grid(True)

# Plot 3: Prediction Evolution

epochs_recorded = np.arange(0, epochs, interval)

for i in range(4):

axs[1, 0].plot(epochs_recorded, prediction_trace[i],

label=f"Input {X[i]} → Target {y[i]}")

axs[1, 0].axhline(0.5, linestyle='--', color='gray', alpha=0.5)

axs[1, 0].set_title("Evolution of Network Predictions")

axs[1, 0].set_xlabel("Epoch")

axs[1, 0].set_ylabel("Predicted Output")

axs[1, 0].legend()

axs[1, 0].grid(True)

# Plot 4: Hidden Layer Activations

for i in range(4):

activations = np.array(hidden_activation_trace[i])

axs[1, 1].plot(epochs_recorded, activations[:, 0],

label=f"Input {X[i]} - Hidden Neuron 1")

axs[1, 1].plot(epochs_recorded, activations[:, 1],

label=f"Input {X[i]} - Hidden Neuron 2", linestyle='--')

axs[1, 1].set_title("Evolution of Hidden Layer Activations")

axs[1, 1].set_xlabel("Epoch")

axs[1, 1].set_ylabel("Activation Value")

axs[1, 1].legend(fontsize='small')

axs[1, 1].grid(True)

plt.tight_layout()

plt.show()

What these plots show

Loss curve: Should steadily decrease
Output weights: Show how the network adjusts its final decision boundary
Predictions: Each XOR input gradually moves toward its correct output
Hidden activations:

One neuron typically learns something like “OR-ish”
The other learns something like “AND-ish not”
Together they form a nonlinear representation that solves XOR

Search This Blog

Human Side of Tech

Solving the XOR Problem

Project 4: Solving the XOR Problem

XOR Dataset

Initializing the Network Parameters

What these shapes mean

Activation Function

Hyperparameters and Traces

Training Loop and Forward Pass

Loss Functions: MSE vs Cross‑Entropy

This is where the project becomes especially educational.

Side‑by‑Side Comparison

Backpropagation: Output Layer

Output Layer : MSE

Output Layer : Cross-Entropy

Backpropagation: Hidden Layer

Gradient Descent Updates

Epoch Loss and Periodic Evaluation

Visualization

What these plots show

Comments

Post a Comment

Popular posts from this blog

How an AI Agent Works Without a Framework

Linear Regression: One Idea, Three Perspectives