Core Concepts of Machine Learning
Core Concepts of Machine Learning: A First‑Principles Glossary
Machine learning can feel like a maze of formulas and jargon, but underneath it all, the field is built from a small set of core ideas. These ideas repeat across linear regression, logistic regression, neural networks, and even modern deep learning.
This glossary collects the essential concepts from the first four projects of the curriculum. Each entry focuses on intuition first, with just enough math to make the idea clear.
Think of this as the conceptual map that ties everything together.
The Design Matrix (X)
The design matrix is the standard way machine learning represents data.
- Each row is one example
- Each column is one feature
- Shape: number of samples by number of features
Why it matters:
- Turns many dot products into one matrix multiplication
- Enables vectorized gradient descent
- Makes batching easy (just slice rows)
In neural networks, the basic operation is:
X times W plus b
The design matrix is the bridge between classical regression and deep learning.
Dot Product
The dot product combines features with weights:
w1x1 + w2x2 + ... + wd*xd
Intuition:
- Measures alignment between the weight vector and the input
- Compresses many features into one number
- Defines the geometry of linear models (lines, planes, hyperplanes)
Every neuron in a neural network begins with a dot product.
Mean Squared Error (MSE)
The loss function for regression:MSE = (1/n) * sum of (y - y_hat)^2
Intuition:
- Measures how far predictions are from targets
- Penalizes large errors heavily
- Equals the squared length of the residual vector
MSE is the language of continuous prediction.
Residuals
Residual = actual value minus predicted valueGeometric view:
- Residuals form a vector in n‑dimensional space
- Linear regression finds the line or plane that makes this vector as short as possible
- The residual vector is perpendicular to the prediction vector
This orthogonality is the heart of ordinary least squares.
Ordinary Least Squares (OLS)
The closed‑form solution to linear regression: beta = inverse(X^T * X) * X^T * y
Interpretation:
- Computes the exact projection of y onto the column space of X
- Gives the geometric solution to minimizing MSE
- Gradient descent is the iterative approximation of this solution
OLS is the cleanest expression of linear regression’s geometry.
Correlation (r) and R‑squared (R2)
Correlation (r):- Measures linear alignment between two centered vectors
- Equivalent to the cosine of the angle between centered x and y
R‑squared (R2):
- The square of correlation
- Measures how much of the variance in y is explained by the model
- Ranges from 0 to 1
- r is the cosine of the angle
- R2 is the squared cosine
Cross‑Entropy Loss
The correct loss for binary classification:J = -[ y * log(y_hat) + (1 - y) * log(1 - y_hat) ]
Intuition:
- Measures how “surprised” the model is by the true label
- Penalizes confident wrong predictions heavily
- Comes from maximum likelihood
Cross‑entropy is the language of probability models.
Sigmoid Function
sigma(z) = 1 / (1 + e^(-z))Purpose:
- Squashes any real number into the range 0 to 1
- Turns linear outputs into probabilities
- Makes logistic regression a classification model
In neural networks, sigmoid is one of many activation functions.
Chain Rule
The chain rule explains how to differentiate composite functions:Derivative of f(g(x)) = f’(g(x)) * g’(x)
Intuition:
- Break a complex function into simple steps
- Multiply the gradients of each step
- This is the backbone of backpropagation
The chain rule is the “flow of influence” through a computation.
Backpropagation
Backpropagation is gradient descent applied to layered functions.Process:
- Forward pass: compute outputs
- Compute loss
- Backward pass: apply the chain rule layer by layer
- Update parameters
Key idea:
Backprop = chain rule + vectorized gradients
Every neural network training loop is built on this.
Power Rule
A basic derivative rule:Derivative of x^n = n * x^(n - 1)
Why it matters:
- Appears inside many gradient calculations
- Used in MSE, polynomial models, and activation derivatives
- Forms part of the “local gradient” in backprop
The power rule is one of the building blocks of symbolic differentiation.
Activation Functions
Nonlinear functions applied after the linear step.Examples:
- Sigmoid
- ReLU
- Tanh
Purpose:
- Allow networks to learn nonlinear patterns
- Without them, stacking layers collapses into a single linear model
Nonlinearity gives neural networks their expressive power.
Decision Boundary
The set of points where the model is undecided.For logistic regression:
w*x + b = 0
Geometry:
- In 2D → a line
- In 3D → a plane
- In higher dimensions → a hyperplane
The weight vector is perpendicular to the boundary.
Linear vs Nonlinear Models
Linear models:- Combine features with a weighted sum
- Can only separate data with a line or plane
- Fail on XOR
Nonlinear models:
- Use activation functions to bend space
- Can separate complex patterns
- Form the basis of neural networks
XOR is the classic example showing the difference.
Vectorized Gradients
Instead of computing gradients one weight at a time:dw = (1/n) * X^T * (y_hat - y)
Benefits:
- Faster
- Cleaner
- Matches neural network math
- Enables batching and GPU acceleration
Vectorization is the modern way to compute gradients.
Batching
Instead of using all data at once:- Take a slice of rows from X
- Compute predictions and gradients on that slice
- Update weights
- Repeat
Benefits:
- Faster
- More stable
- Works with large datasets
- Matches how GPUs operate
Batching is simply “mini design matrices.”
Summary
These concepts form the foundation of:
- Linear regression
- Logistic regression
- Neural networks
- Backpropagation
- Deep learning architectures
Once you understand these ideas, the entire field becomes far less mysterious.
Every model — from a single neuron to a transformer — is built from these same ingredients.
Comments
Post a Comment