Linear Regression: One Idea, Three Perspectives

Linear Regression: One Idea, Three Perspectives

Linear regression looks simple on the surface — draw a line through data — but underneath that simplicity is a surprisingly elegant structure. What most people don’t realize is that linear regression is actually one idea expressed in three different mathematical languages:

Kaggle Notebook 

GitHub Repo

  1. Optimization form — learn the line by minimizing a loss function
  2. Geometric form — compute the line as a projection
  3. Algebraic form — compute the line using means, sums, covariance, and variance


Each form tells the same story from a different angle. When you see all three together, the whole picture snaps into place.

This post walks through each form using the same dataset and the same goal:
find the best‑fit line that predicts y from x.


1. Optimization Form: Learning the Line by Minimizing Loss

The optimization view is the one used in machine learning.



You start with random parameters, measure how wrong the model is, and adjust the parameters to reduce that error.

The Data

import numpy as np
import matplotlib.pyplot as plt

x = np.array([180, 200, 230, 260, 280, 300, 325, 375, 425, 480, 488, 510, 560, 600])
y = np.array([122, 120, 170, 180, 240, 238, 246, 320, 361, 370, 376, 390, 410, 470])

plt.scatter(x, y, color='blue')
plt.title("Raw Data")
plt.xlabel("x")
plt.ylabel("y")
plt.grid(True)
plt.show()

Centering & Normalizing x

Gradient descent behaves dramatically better when features are normalized.

  • Centering shifts the cloud, so its mean is at 0
  • Normalization scales the spread to 1
```bash
x_mean = x.mean()
x_std = x.std()

x_center = x - x_mean
x_norm = x_center / x_std
```

This makes the optimization landscape smooth and well‑behaved.


Gradient Descent

The idea:

  1. Start with random slope m and intercept c
  2. Compute predictions
  3. Compute the loss (MSE)
  4. Compute gradients — how much the loss changes if you nudge m or c
  5. Update parameters:


m = np.random.randn()
c = np.random.randn()
lr = 0.01

loss_history = []

for epoch in range(250):
    y_pred = m * x_norm + c
    error = y_pred - y

    dm = (2 * (error * x_norm)).mean()
    #dm = (2/n) * np.sum(error * x)
    dc = (2 * error).mean()
    #db = (2/n) * np.sum(error)

    m -= lr * dm
    c -= lr * dc

    loss_history.append((error**2).mean())

After many iterations, the parameters converge.


Convert Back to Real Units

Because we trained on normalized x:

m_real = m / x_std
c_real = c - m_real * x_mean

This gives the true slope and intercept in the original coordinate system.


2. Geometric Form: The Projection Formula (OLS)

This is the most elegant view.

Once you center the data, linear regression becomes:

Projecting the vector y onto the vector x.


 

The projection formula gives the exact solution in one step:

X = np.column_stack([np.ones_like(x), x])
y_reshape = y.reshape(-1, 1)

beta = np.linalg.inv(X.T @ X) @ (X.T @ y_reshape)
b, m = beta.flatten()

This is the classic Ordinary Least Squares solution.

  • y_hat = X @ beta is the projection of y
  • residual = y - y_hat is the orthogonal leftover

Geometrically:

  • y_hat lies in the span of x
  • residual is perpendicular to x
  • The squared length of the residual is the MSE

This is linear algebra at its cleanest.


3. Algebraic Form: Means, Sums, Covariance, Variance

This is the form taught in statistics classes — no matrices, no optimization, just arithmetic.



Step 1: Means

x_mean = np.mean(x)
y_mean = np.mean(y)

Step 2: Covariance and Variance

cov_xy = np.sum((x - x_mean) * (y - y_mean))
var_x  = np.sum((x - x_mean)**2)

Step 3: Slope

m = cov_xy / var_x

Step 4: Intercept

b = y_mean - m * x_mean

Step 5: Predictions

y_hat = m * x + b

This is the same line found by gradient descent and by projection.


How the Three Forms Connect

Even though the math looks different, all three approaches converge to the same line.

FormWhat it emphasizesHow it finds the line
OptimizationLearning, loss minimizationIteratively reduces MSE
GeometricVector projectionOne-step projection of y onto x
AlgebraicMeans, sums, covarianceClosed-form slope/intercept

They are three languages describing the same geometry.


Key Insights

  • Centering moves the cloud to the origin
  • Normalization stabilizes gradient descent
  • Gradient descent learns the line by minimizing MSE
  • Projection computes the line in one step
  • Correlation is cosine similarity between centered x and y
  • is the squared cosine
  • MSE is the squared length of the residual vector
  • All three methods produce the same best‑fit line


Comments

Popular posts from this blog

How an AI Agent Works Without a Framework