Core Concepts of Machine Learning: A First‑Principles Glossary

Machine learning can feel like a maze of formulas and jargon, but underneath it all, the field is built from a small set of core ideas. These ideas repeat across linear regression, logistic regression, neural networks, and even modern deep learning.

This glossary collects the essential concepts from the first four projects of the curriculum. Each entry focuses on intuition first, with just enough math to make the idea clear.

Think of this as the conceptual map that ties everything together.

The Design Matrix (X)

The design matrix is the standard way machine learning represents data.

Each row is one example
Each column is one feature
Shape: number of samples by number of features

Why it matters:

Turns many dot products into one matrix multiplication
Enables vectorized gradient descent
Makes batching easy (just slice rows)

In neural networks, the basic operation is:

X times W plus b

The design matrix is the bridge between classical regression and deep learning.

Dot Product

The dot product combines features with weights:

w1x1 + w2x2 + ... + wd*xd

Intuition:

Measures alignment between the weight vector and the input
Compresses many features into one number
Defines the geometry of linear models (lines, planes, hyperplanes)

Every neuron in a neural network begins with a dot product.

Mean Squared Error (MSE)

The loss function for regression:

MSE = (1/n) * sum of (y - y_hat)^2

Intuition:

Measures how far predictions are from targets
Penalizes large errors heavily
Equals the squared length of the residual vector

MSE is the language of continuous prediction.

Residuals

Residual = actual value minus predicted value

Geometric view:

Residuals form a vector in n‑dimensional space
Linear regression finds the line or plane that makes this vector as short as possible
The residual vector is perpendicular to the prediction vector

This orthogonality is the heart of ordinary least squares.

Ordinary Least Squares (OLS)

The closed‑form solution to linear regression:

beta = inverse(X^T * X) * X^T * y

Interpretation:

Computes the exact projection of y onto the column space of X
Gives the geometric solution to minimizing MSE
Gradient descent is the iterative approximation of this solution

OLS is the cleanest expression of linear regression’s geometry.

Correlation (r) and R‑squared (R2)

Correlation (r):

Measures linear alignment between two centered vectors
Equivalent to the cosine of the angle between centered x and y

R‑squared (R2):

The square of correlation
Measures how much of the variance in y is explained by the model
Ranges from 0 to 1

Geometric intuition:

r is the cosine of the angle
R2 is the squared cosine

Cross‑Entropy Loss

The correct loss for binary classification:

J = -[ y * log(y_hat) + (1 - y) * log(1 - y_hat) ]

Intuition:

Measures how “surprised” the model is by the true label
Penalizes confident wrong predictions heavily
Comes from maximum likelihood

Cross‑entropy is the language of probability models.

Sigmoid Function

sigma(z) = 1 / (1 + e^(-z))

Purpose:

Squashes any real number into the range 0 to 1
Turns linear outputs into probabilities
Makes logistic regression a classification model

In neural networks, sigmoid is one of many activation functions.

Chain Rule

The chain rule explains how to differentiate composite functions:

Derivative of f(g(x)) = f’(g(x)) * g’(x)

Intuition:

Break a complex function into simple steps
Multiply the gradients of each step
This is the backbone of backpropagation

The chain rule is the “flow of influence” through a computation.

Backpropagation

Backpropagation is gradient descent applied to layered functions.

Process:

Forward pass: compute outputs
Compute loss
Backward pass: apply the chain rule layer by layer
Update parameters

Key idea:
Backprop = chain rule + vectorized gradients

Every neural network training loop is built on this.

Power Rule

A basic derivative rule:

Derivative of x^n = n * x^(n - 1)

Why it matters:

Appears inside many gradient calculations
Used in MSE, polynomial models, and activation derivatives
Forms part of the “local gradient” in backprop

The power rule is one of the building blocks of symbolic differentiation.

Activation Functions

Nonlinear functions applied after the linear step.

Examples:

Sigmoid
ReLU
Tanh

Purpose:

Allow networks to learn nonlinear patterns
Without them, stacking layers collapses into a single linear model

Nonlinearity gives neural networks their expressive power.

Decision Boundary

The set of points where the model is undecided.

For logistic regression:

w*x + b = 0

Geometry:

In 2D → a line
In 3D → a plane
In higher dimensions → a hyperplane

The weight vector is perpendicular to the boundary.

Linear vs Nonlinear Models

Linear models:

Combine features with a weighted sum
Can only separate data with a line or plane
Fail on XOR

Nonlinear models:

Use activation functions to bend space
Can separate complex patterns
Form the basis of neural networks

XOR is the classic example showing the difference.

Vectorized Gradients

Instead of computing gradients one weight at a time:

dw = (1/n) * X^T * (y_hat - y)

Benefits:

Faster
Cleaner
Matches neural network math
Enables batching and GPU acceleration

Vectorization is the modern way to compute gradients.

Batching

Instead of using all data at once:

Take a slice of rows from X
Compute predictions and gradients on that slice
Update weights
Repeat

Benefits:

Faster
More stable
Works with large datasets
Matches how GPUs operate

Batching is simply “mini design matrices.”

Summary

These concepts form the foundation of:

Linear regression
Logistic regression
Neural networks
Backpropagation
Deep learning architectures

Once you understand these ideas, the entire field becomes far less mysterious.
Every model — from a single neuron to a transformer — is built from these same ingredients.

Search This Blog

Human Side of Tech