Core Concepts of Machine Learning

Core Concepts of Machine Learning: A First‑Principles Glossary


Machine learning can feel like a maze of formulas and jargon, but underneath it all, the field is built from a small set of core ideas. These ideas repeat across linear regression, logistic regression, neural networks, and even modern deep learning.

This glossary collects the essential concepts from the first four projects of the curriculum. Each entry focuses on intuition first, with just enough math to make the idea clear.

Think of this as the conceptual map that ties everything together.

 

The Design Matrix (X)

The design matrix is the standard way machine learning represents data.

  • Each row is one example
  • Each column is one feature
  • Shape: number of samples by number of features

Why it matters:
  • Turns many dot products into one matrix multiplication
  • Enables vectorized gradient descent
  • Makes batching easy (just slice rows)

In neural networks, the basic operation is:

X times W plus b

The design matrix is the bridge between classical regression and deep learning.
 

Dot Product



The dot product combines features with weights:

w1x1 + w2x2 + ... + wd*xd

Intuition:
  • Measures alignment between the weight vector and the input
  • Compresses many features into one number
  • Defines the geometry of linear models (lines, planes, hyperplanes)

Every neuron in a neural network begins with a dot product.

 

Mean Squared Error (MSE)

The loss function for regression:

MSE = (1/n) * sum of (y - y_hat)^2

Intuition:
  • Measures how far predictions are from targets
  • Penalizes large errors heavily
  • Equals the squared length of the residual vector

MSE is the language of continuous prediction.

 

Residuals

Residual = actual value minus predicted value

Geometric view:
  • Residuals form a vector in n‑dimensional space
  • Linear regression finds the line or plane that makes this vector as short as possible
  • The residual vector is perpendicular to the prediction vector

This orthogonality is the heart of ordinary least squares.

 
Ordinary Least Squares (OLS)

The closed‑form solution to linear regression:

beta = inverse(X^T * X) * X^T * y


Interpretation:
  • Computes the exact projection of y onto the column space of X
  • Gives the geometric solution to minimizing MSE
  • Gradient descent is the iterative approximation of this solution

OLS is the cleanest expression of linear regression’s geometry.

 

Correlation (r) and R‑squared (R2)

Correlation (r):
  • Measures linear alignment between two centered vectors
  • Equivalent to the cosine of the angle between centered x and y

R‑squared (R2):
  • The square of correlation
  • Measures how much of the variance in y is explained by the model
  • Ranges from 0 to 1
Geometric intuition:
  • r is the cosine of the angle
  • R2 is the squared cosine

 

Cross‑Entropy Loss

The correct loss for binary classification:

J = -[ y * log(y_hat) + (1 - y) * log(1 - y_hat) ]

Intuition:
  • Measures how “surprised” the model is by the true label
  • Penalizes confident wrong predictions heavily
  • Comes from maximum likelihood

Cross‑entropy is the language of probability models.

 

Sigmoid Function

sigma(z) = 1 / (1 + e^(-z))

Purpose:

  • Squashes any real number into the range 0 to 1
  • Turns linear outputs into probabilities
  • Makes logistic regression a classification model

In neural networks, sigmoid is one of many activation functions.

 

Chain Rule

The chain rule explains how to differentiate composite functions:

Derivative of f(g(x)) = f’(g(x)) * g’(x)

Intuition:
  • Break a complex function into simple steps
  • Multiply the gradients of each step
  • This is the backbone of backpropagation

The chain rule is the “flow of influence” through a computation.

 

Backpropagation

Backpropagation is gradient descent applied to layered functions.

Process:
  • Forward pass: compute outputs
  • Compute loss
  • Backward pass: apply the chain rule layer by layer
  • Update parameters

Key idea:
Backprop = chain rule + vectorized gradients

Every neural network training loop is built on this.

 

Power Rule

A basic derivative rule:

Derivative of x^n = n * x^(n - 1)

Why it matters:
  • Appears inside many gradient calculations
  • Used in MSE, polynomial models, and activation derivatives
  • Forms part of the “local gradient” in backprop

The power rule is one of the building blocks of symbolic differentiation.

 

Activation Functions

Nonlinear functions applied after the linear step.

Examples:
  • Sigmoid
  • ReLU
  • Tanh

Purpose:
  • Allow networks to learn nonlinear patterns
  • Without them, stacking layers collapses into a single linear model

Nonlinearity gives neural networks their expressive power.

 

Decision Boundary

The set of points where the model is undecided.

For logistic regression:

w*x + b = 0

Geometry:
  • In 2D → a line
  • In 3D → a plane
  • In higher dimensions → a hyperplane

The weight vector is perpendicular to the boundary.

 

Linear vs Nonlinear Models

Linear models:
  • Combine features with a weighted sum
  • Can only separate data with a line or plane
  • Fail on XOR

Nonlinear models:

  • Use activation functions to bend space
  • Can separate complex patterns
  • Form the basis of neural networks

XOR is the classic example showing the difference.

Vectorized Gradients

Instead of computing gradients one weight at a time:

dw = (1/n) * X^T * (y_hat - y)

Benefits:
  • Faster
  • Cleaner
  • Matches neural network math
  • Enables batching and GPU acceleration

Vectorization is the modern way to compute gradients.
 

Batching

Instead of using all data at once:
  • Take a slice of rows from X
  • Compute predictions and gradients on that slice
  • Update weights
  • Repeat

Benefits:

  • Faster
  • More stable
  • Works with large datasets
  • Matches how GPUs operate

Batching is simply “mini design matrices.”

Summary

These concepts form the foundation of:

  • Linear regression
  • Logistic regression
  • Neural networks
  • Backpropagation
  • Deep learning architectures

Once you understand these ideas, the entire field becomes far less mysterious.
Every model — from a single neuron to a transformer — is built from these same ingredients.

Comments

Popular posts from this blog

How an AI Agent Works Without a Framework

Linear Regression: One Idea, Three Perspectives