Linear Regression: One Idea, Three Perspectives

Linear regression looks simple on the surface — draw a line through data — but underneath that simplicity is a surprisingly elegant structure. What most people don’t realize is that linear regression is actually one idea expressed in three different mathematical languages:

Kaggle Notebook

GitHub Repo

Optimization form — learn the line by minimizing a loss function
Geometric form — compute the line as a projection
Algebraic form — compute the line using means, sums, covariance, and variance

Each form tells the same story from a different angle. When you see all three together, the whole picture snaps into place.

This post walks through each form using the same dataset and the same goal:
find the best‑fit line that predicts y from x.

1. Optimization Form: Learning the Line by Minimizing Loss

The optimization view is the one used in machine learning.

You start with random parameters, measure how wrong the model is, and adjust the parameters to reduce that error.

The Data

import numpy as np
import matplotlib.pyplot as plt

x = np.array([180, 200, 230, 260, 280, 300, 325, 375, 425, 480, 488, 510, 560, 600])
y = np.array([122, 120, 170, 180, 240, 238, 246, 320, 361, 370, 376, 390, 410, 470])

plt.scatter(x, y, color='blue')
plt.title("Raw Data")
plt.xlabel("x")
plt.ylabel("y")
plt.grid(True)
plt.show()

Centering & Normalizing x

Gradient descent behaves dramatically better when features are normalized.

Centering shifts the cloud, so its mean is at 0
Normalization scales the spread to 1

```bash

x_mean = x.mean()
x_std = x.std()

x_center = x - x_mean
x_norm = x_center / x_std

```

This makes the optimization landscape smooth and well‑behaved.

Gradient Descent

The idea:

Start with random slope m and intercept c
Compute predictions
Compute the loss (MSE)
Compute gradients — how much the loss changes if you nudge m or c
Update parameters:

m = np.random.randn()
c = np.random.randn()
lr = 0.01

loss_history = []

for epoch in range(250):
    y_pred = m * x_norm + c
    error = y_pred - y

    dm = (2 * (error * x_norm)).mean()
    #dm = (2/n) * np.sum(error * x)
    dc = (2 * error).mean()
    #db = (2/n) * np.sum(error)

    m -= lr * dm
    c -= lr * dc

    loss_history.append((error**2).mean())

After many iterations, the parameters converge.

Convert Back to Real Units

Because we trained on normalized x:

m_real = m / x_std
c_real = c - m_real * x_mean

This gives the true slope and intercept in the original coordinate system.

2. Geometric Form: The Projection Formula (OLS)

This is the most elegant view.

Once you center the data, linear regression becomes:

Projecting the vector y onto the vector x.

The projection formula gives the exact solution in one step:

X = np.column_stack([np.ones_like(x), x])
y_reshape = y.reshape(-1, 1)

beta = np.linalg.inv(X.T @ X) @ (X.T @ y_reshape)
b, m = beta.flatten()

This is the classic Ordinary Least Squares solution.

y_hat = X @ beta is the projection of y
residual = y - y_hat is the orthogonal leftover

Geometrically:

y_hat lies in the span of x
residual is perpendicular to x
The squared length of the residual is the MSE

This is linear algebra at its cleanest.

3. Algebraic Form: Means, Sums, Covariance, Variance

This is the form taught in statistics classes — no matrices, no optimization, just arithmetic.

Step 1: Means

x_mean = np.mean(x)
y_mean = np.mean(y)

Step 2: Covariance and Variance

cov_xy = np.sum((x - x_mean) * (y - y_mean))
var_x  = np.sum((x - x_mean)**2)

Step 3: Slope

m = cov_xy / var_x

Step 4: Intercept

b = y_mean - m * x_mean

Step 5: Predictions

y_hat = m * x + b

This is the same line found by gradient descent and by projection.

How the Three Forms Connect

Even though the math looks different, all three approaches converge to the same line.

Form	What it emphasizes	How it finds the line
Optimization	Learning, loss minimization	Iteratively reduces MSE
Geometric	Vector projection	One-step projection of y onto x
Algebraic	Means, sums, covariance	Closed-form slope/intercept

They are three languages describing the same geometry.

Key Insights

Centering moves the cloud to the origin
Normalization stabilizes gradient descent
Gradient descent learns the line by minimizing MSE
Projection computes the line in one step
Correlation is cosine similarity between centered x and y
R² is the squared cosine
MSE is the squared length of the residual vector
All three methods produce the same best‑fit line

Search This Blog

Human Side of Tech