Linear Regression: One Idea, Three Perspectives
Linear Regression: One Idea, Three Perspectives
Linear regression looks simple on the surface — draw a line through data — but underneath that simplicity is a surprisingly elegant structure. What most people don’t realize is that linear regression is actually one idea expressed in three different mathematical languages:
- Optimization form — learn the line by minimizing a loss function
- Geometric form — compute the line as a projection
- Algebraic form — compute the line using means, sums, covariance, and variance
Each form tells the same story from a different angle. When you see all three together, the whole picture snaps into place.
This post walks through each form using the same dataset and the same goal:
find the best‑fit line that predicts y from x.
1. Optimization Form: Learning the Line by Minimizing Loss
The optimization view is the one used in machine learning.
You start with random parameters, measure how wrong the model is, and adjust the parameters to reduce that error.
The Data
import numpy as np
import matplotlib.pyplot as plt
x = np.array([180, 200, 230, 260, 280, 300, 325, 375, 425, 480, 488, 510, 560, 600])
y = np.array([122, 120, 170, 180, 240, 238, 246, 320, 361, 370, 376, 390, 410, 470])
plt.scatter(x, y, color='blue')
plt.title("Raw Data")
plt.xlabel("x")
plt.ylabel("y")
plt.grid(True)
plt.show()
Centering & Normalizing x
Gradient descent behaves dramatically better when features are normalized.
- Centering shifts the cloud, so its mean is at 0
- Normalization scales the spread to 1
```bashx_mean = x.mean()
x_std = x.std()
x_center = x - x_mean
x_norm = x_center / x_std
```
This makes the optimization landscape smooth and well‑behaved.
Gradient Descent
The idea:
- Start with random slope
mand interceptc - Compute predictions
- Compute the loss (MSE)
- Compute gradients — how much the loss changes if you nudge
morc - Update parameters:
m = np.random.randn()
c = np.random.randn()
lr = 0.01
loss_history = []
for epoch in range(250):
y_pred = m * x_norm + c
error = y_pred - y
dm = (2 * (error * x_norm)).mean()
#dm = (2/n) * np.sum(error * x)
dc = (2 * error).mean()
#db = (2/n) * np.sum(error)
m -= lr * dm
c -= lr * dc
loss_history.append((error**2).mean())
After many iterations, the parameters converge.
Convert Back to Real Units
Because we trained on normalized x:
m_real = m / x_std
c_real = c - m_real * x_mean
This gives the true slope and intercept in the original coordinate system.
2. Geometric Form: The Projection Formula (OLS)
This is the most elegant view.
Once you center the data, linear regression becomes:
Projecting the vector y onto the vector x.
The projection formula gives the exact solution in one step:
X = np.column_stack([np.ones_like(x), x])
y_reshape = y.reshape(-1, 1)
beta = np.linalg.inv(X.T @ X) @ (X.T @ y_reshape)
b, m = beta.flatten()
This is the classic Ordinary Least Squares solution.
y_hat = X @ betais the projection of yresidual = y - y_hatis the orthogonal leftover
Geometrically:
y_hatlies in the span of xresidualis perpendicular to x- The squared length of the residual is the MSE
This is linear algebra at its cleanest.
3. Algebraic Form: Means, Sums, Covariance, Variance
This is the form taught in statistics classes — no matrices, no optimization, just arithmetic.
Step 1: Means
x_mean = np.mean(x)
y_mean = np.mean(y)
Step 2: Covariance and Variance
cov_xy = np.sum((x - x_mean) * (y - y_mean))
var_x = np.sum((x - x_mean)**2)
Step 3: Slope
m = cov_xy / var_x
Step 4: Intercept
b = y_mean - m * x_mean
Step 5: Predictions
y_hat = m * x + b
This is the same line found by gradient descent and by projection.
How the Three Forms Connect
Even though the math looks different, all three approaches converge to the same line.
| Form | What it emphasizes | How it finds the line |
|---|---|---|
| Optimization | Learning, loss minimization | Iteratively reduces MSE |
| Geometric | Vector projection | One-step projection of y onto x |
| Algebraic | Means, sums, covariance | Closed-form slope/intercept |
They are three languages describing the same geometry.
Key Insights
- Centering moves the cloud to the origin
- Normalization stabilizes gradient descent
- Gradient descent learns the line by minimizing MSE
- Projection computes the line in one step
- Correlation is cosine similarity between centered x and y
- R² is the squared cosine
- MSE is the squared length of the residual vector
- All three methods produce the same best‑fit line





Comments
Post a Comment