These notes are designed for final-year undergraduate exams. Each topic follows a consistent structure: introduction, core concepts, compact formulas in LaTeX, DOT diagrams you can render with Graphviz, real-life examples, tips, mistakes, and exam-style summaries. After you review one section, try a quick self-check: can you state the goal, the loss, and one pitfall? If you share your university syllabus or past papers, I can tailor emphasis and add targeted practice.
Supervised learning learns a mapping from inputs to outputs using labeled examples. We choose a model
- Labeled data: pairs
$$(x_i, y_i)$$ . Features$$x$$ , target$$y$$ . - Hypothesis/model:
$$\hat{y} = f_\theta(x)$$ . - Loss and risk: empirical risk
$$\min_\theta \sum_i \ell(f_\theta(x_i), y_i)$$ . - Tasks: regression (continuous
$$y$$ ), classification (categorical$$y$$ ). - Generalization: performance on new data; controlled via regularization, validation, and proper model selection.
digraph G {
graph [rankdir=LR, fontname=Helvetica];
node [shape=rounded, style=filled, fillcolor="#eef5ff", color="#cccccc", fontname=Helvetica];
edge [color="#bbbbbb"];
D[label="Labeled Data (x,y)"];
Split[label="Split: Train/Val/Test"];
Train[label="Train: minimize loss"];
Select[label="Model select + regularize"];
Predict[label="Predict on new x"];
Eval[label="Evaluate metrics"];
D -> Split -> Train -> Select -> Predict -> Eval;
}- Regression:
$$y$$ = house price given$$x$$ =area, location, age. - Classification: spam vs ham emails based on word features.
- Empirical risk:
$$\hat{R}(\theta)=\frac{1}{n}\sum_{i=1}^n \ell(f_\theta(x_i), y_i)$$ . - Regularization:
$$\hat{R}(\theta)+\lambda \Omega(\theta)$$ where$$\Omega$$ controls complexity. - Metrics: regression (RMSE, MAE,
$$R^2$$ ); classification (accuracy, precision, recall, F1, ROC-AUC, PR-AUC).
- “R for Real numbers” (regression), “C for Classes” (classification).
- Pipeline mnemonic: PIPE = Problem → Inputs/labels → Predictive model → Evaluation.
- Using accuracy for highly imbalanced classes; prefer precision/recall/F1/PR-AUC.
- Tuning hyperparameters on the test set (leaks performance).
- Learn
$$f_\theta$$ from labeled data by minimizing loss with regularization. - Match metric to task and class balance; use validation for model selection.
Distance-based methods predict using similarity. kNN stores training data; for a query
- Distance metric
$$d(x,x')$$ : Euclidean$$\ell_2$$ , Manhattan$$\ell_1$$ , cosine, Minkowski. - kNN classification: majority vote among nearest
$$k$$ neighbors. - kNN regression: average (or distance-weighted average) of neighbors’
$$y$$ -values. - Non-parametric, “lazy” learning: minimal training; computation happens at query time.
- Feature scaling often required for meaningful distances.
digraph G {
graph [rankdir=LR, fontname=Helvetica];
node [shape=rounded, style=filled, fillcolor="#eaf7f0", color="#cccccc", fontname=Helvetica];
edge [color="#bbbbbb"];
Q[label="Query x*"];
Scale[label="Scale features"];
Dist[label="Compute distances"];
Select[label="Pick k nearest"];
Agg[label="Aggregate: vote/mean"];
Pred[label="Output y*"];
Q -> Scale -> Dist -> Select -> Agg -> Pred;
}- Recommending a product by looking at “nearest” customers.
- Estimating house price from prices of most similar houses.
- Euclidean:
$$d(x,x')=\sqrt{\sum_j (x_j-x'_j)^2}$$ . - Weighted kNN: $$\hat{y}(x^)=\frac{\sum_{i\in N_k} w_i y_i}{\sum_{i\in N_k} w_i}$$, with $$w_i=\frac{1}{d(x^,x_i)+\varepsilon}$$.
- Curse of dimensionality: distances become less discriminative as
$$p$$ grows. - Complexity: naive query
$$\mathcal{O}(np)$$ ; use KD-trees/ball trees when feasible.
- “k closest friends decide.” Remember SKM: Scale → choose
$$k$$ → choose Metric.
- Not scaling features; one feature can dominate distance.
- Too small
$$k$$ → noisy; too large$$k$$ → biased toward majority class.
- Simple, strong baseline; key choices:
$$k$$ , metric, scaling, tie-breaking.
Decision trees split the feature space using rules that reduce impurity. They handle nonlinearity and mixed feature types and are easy to interpret.
- Structure: root → internal nodes (tests) → leaves (predictions).
- Splitting criteria (classification): entropy, Gini; (regression): variance/MSE.
- Stopping/pruning: max depth, min samples per leaf, cost-complexity pruning.
- Pros: interpretability, handles missing values (surrogates), no scaling needed. Cons: overfitting without control.
digraph G {
graph [fontname=Helvetica];
node [shape=rounded, style=filled, fillcolor="#fff1e6", color="#cccccc", fontname=Helvetica];
edge [color="#bbbbbb"];
Root[label="Root: choose best split\n(Gini/Entropy/MSE)"];
L[label="Left subset"];
R[label="Right subset"];
Leaf1[label="Leaf: predict class/mean"];
Leaf2[label="Leaf: predict class/mean"];
Root -> L -> Leaf1;
Root -> R -> Leaf2;
}- Credit approval using thresholds on income, history.
- Medical triage with simple rule paths.
- Entropy:
$$H(S)=-\sum_c p_c\log_2 p_c$$ . - Gini:
$$G(S)=1-\sum_c p_c^2$$ . - Information gain:
$$IG=H(S)-\sum_i \frac{|S_i|}{|S|}H(S_i)$$ . - Regression impurity: node MSE
$$=\frac{1}{|S|}\sum_{i\in S}(y_i-\bar{y}_S)^2$$ .
- “GE-M”: Gini, Entropy for classification; MSE for regression.
- Build deep, then prune back (for generalization).
- Allowing unconstrained depth → overfitting. Use pruning or depth limits.
- Choosing splits by accuracy instead of impurity in imbalanced data.
- Trees partition space to reduce impurity; interpretability is key. Control complexity to prevent overfitting.
Naive Bayes is a probabilistic classifier applying Bayes’ rule with a conditional independence assumption. It is very effective for high-dimensional sparse data (e.g., text).
- Bayes’ theorem:
$$P(y\mid x)=\frac{P(x\mid y)P(y)}{P(x)}\propto P(x\mid y)P(y)$$ . - Naive assumption:
$$P(x\mid y)=\prod_{j} P(x_j\mid y)$$ . - Variants: Gaussian NB (continuous), Multinomial NB (counts), Bernoulli NB (binary features).
- Smoothing avoids zero probabilities (Laplace/Lidstone).
digraph G {
graph [rankdir=LR, fontname=Helvetica];
node [shape=rounded, style=filled, fillcolor="#eef5ff", color="#cccccc", fontname=Helvetica];
edge [color="#bbbbbb"];
Est[label="Estimate P(y), P(x_j|y)"];
Post[label="Compute P(y|x) ∝ P(y)∏ P(x_j|y)"];
Arg[label="Predict argmax_y P(y|x)"];
Est -> Post -> Arg;
}- Spam detection using word counts (Multinomial NB).
- Sentiment analysis with word presence (Bernoulli NB).
- Decision rule:
$$\hat{y}=\arg\max_y \big[P(y)\prod_j P(x_j\mid y)\big]$$ . - Laplace smoothing (Multinomial): $$\hat{\theta}{yi}=\frac{N{yi}+\alpha}{N_{y}+\alpha n}$$, with
$$\alpha=1$$ commonly used. - Gaussian NB:
$$x_j\mid y \sim \mathcal{N}(\mu_{jy},\sigma_{jy}^2)$$ and use class-conditional densities.
- “GMB” variants: Gaussian, Multinomial, Bernoulli. “NB = Naively Break into products.”
- Forgetting smoothing → zeroing entire posteriors.
- Applying Multinomial NB to continuous data without discretization.
- NB = Bayes + independence; fast, strong baseline for text; pick variant by feature type and apply smoothing.
Linear regression models the conditional mean of
- Model:
$$\hat{y}=\beta_0+\sum_{j=1}^p \beta_j x_j = X\beta$$ . - Objective (OLS): $$\min_\beta \sum_{i=1}^n (y_i-\hat{y}i)^2 = \min\beta \lVert y - X\beta \rVert_2^2$$.
- Fitting: normal equations or gradient-based optimization.
- Regularization: Ridge (
$$\ell_2$$ ), Lasso ($$\ell_1$$ ), Elastic Net (mix of$$\ell_1$$ ,$$\ell_2$$ ).
digraph G {
graph [rankdir=LR, fontname=Helvetica];
node [shape=rounded, style=filled, fillcolor="#eaf7f0", color="#cccccc", fontname=Helvetica];
edge [color="#bbbbbb"];
Data[label="(X,y)"];
Fit[label="Fit β: minimize MSE"];
Reg[label="Regularize (λ)"];
Pred[label="Predict ŷ=Xβ"];
Data -> Fit -> Reg -> Pred;
}- Forecasting sales from price, ads, season.
- Calibrating sensors with linear response.
- Closed-form (if
$$X^TX$$ invertible):$$\hat{\beta}=(X^TX)^{-1}X^Ty$$ . - Ridge:
$$\min_\beta \lVert y-X\beta \rVert_2^2+\lambda\lVert\beta\rVert_2^2$$ . - Lasso:
$$\min_\beta \lVert y-X\beta \rVert_2^2+\lambda\lVert\beta\rVert_1$$ . - Metrics:
$$\text{RMSE}=\sqrt{\tfrac{1}{n}\sum (y_i-\hat{y}_i)^2}$$ ,$$\text{MAE}=\tfrac{1}{n}\sum |y_i-\hat{y}_i|$$ ,$$R^2=1-\tfrac{\sum (y_i-\hat{y}_i)^2}{\sum (y_i-\bar{y})^2}$$ .
- “RLL”: Ridge=$$\ell_2$$ shrinks; Lasso=$$\ell_1$$ makes sparse; Elastic Net=mix.
- Relying only on
$$R^2$$ ; always check residual plots and RMSE/MAE. - Skipping feature scaling before regularization or gradient descent.
- Linear regression minimizes squared error; use regularization to reduce overfitting; evaluate with appropriate metrics.
Logistic regression models
- Score:
$$z=w^Tx+b$$ . Sigmoid:$$\sigma(z)=\frac{1}{1+e^{-z}}$$ . - Probability:
$$P(y=1\mid x)=\sigma(w^Tx+b)$$ . Decision: predict 1 if$$P>\tau$$ (often$$\tau=0.5$$ ). - Loss: cross-entropy
$$-\sum_i [y_i\log p_i+(1-y_i)\log(1-p_i)]$$ . Optimization via gradient descent, Newton, or LBFGS. - Regularization:
$$\ell_2$$ ,$$\ell_1$$ . Multiclass via softmax (see Topic 9).
digraph G {
graph [rankdir=LR, fontname=Helvetica];
node [shape=rounded, style=filled, fillcolor="#fff1e6", color="#cccccc", fontname=Helvetica];
edge [color="#bbbbbb"];
X[label="Features x"];
Lin[label="z=w^Tx+b"];
Sig[label="σ(z)=1/(1+e^{-z})"];
Prob[label="P(y=1|x)"];
Th[label="Threshold → class"];
X -> Lin -> Sig -> Prob -> Th;
}- Credit default risk, click-through prediction, disease diagnosis.
- Log-odds:
$$\log\frac{p}{1-p}=w^Tx+b$$ . - Gradient:
$$\nabla_w = X^T(p - y) + 2\lambda w$$ for$$\ell_2$$ -regularized binary case. - Calibration: change threshold
$$\tau$$ to balance precision/recall.
- “Logistic → Log-odds linear in features.”
- Ignoring class imbalance; adjust threshold or use class weights.
- Interpreting coefficients without feature scaling or interaction terms.
- Linear decision boundary with probabilistic output; optimize cross-entropy; tune regularization and threshold.
GLMs extend linear models to non-Gaussian targets by linking the mean
- Components: distribution (from exponential family), linear predictor
$$\eta=X\beta$$ , link$$g$$ . - Examples: Gaussian–identity (linear regression), Bernoulli–logit (logistic), Poisson–log (count data), Gamma–log (positive skewed).
- Estimation: maximum likelihood; often via IRLS (iteratively reweighted least squares).
digraph G {
graph [fontname=Helvetica];
node [shape=rounded, style=filled, fillcolor="#eef5ff", color="#cccccc", fontname=Helvetica];
edge [color="#bbbbbb"];
Y[label="Y ~ Exponential family"];
Link[label="Link g(μ)=η=Xβ"];
Fit[label="Fit β via MLE/IRLS"];
Pred[label="Predict: μ=g^{-1}(Xβ)"];
Y -> Link -> Fit -> Pred;
}- Poisson regression for counts (e.g., calls per hour).
- Gamma regression for insurance claim sizes.
- Canonical links: identity (Gaussian), logit (Bernoulli), log (Poisson).
- Deviance compares model fit to saturated model; dispersion affects variance.
- “Three Gs of GLM”: Gaussian/Identity, Bernoulli/Logit, Poisson/Log.
- Using Poisson when variance >> mean (overdispersion); consider Negative Binomial.
- Forgetting to use appropriate link for support (e.g., log for positive outcomes).
- Choose distribution + link to match target; fit via MLE/IRLS; check dispersion and residuals.
SVMs find a decision boundary with maximum margin between classes. With kernels, SVMs model nonlinear boundaries by operating in an implicit feature space.
- Hard-margin (separable): maximize margin subject to
$$y_i(w^Tx_i+b)\ge 1$$ . - Soft-margin: allow slack
$$\xi_i$$ with penalty$$C$$ :$$\min_{w,b,\xi} \tfrac{1}{2}\lVert w\rVert^2 + C\sum_i \xi_i$$ s.t.$$y_i(w^Tx_i+b)\ge 1-\xi_i,\ \xi_i\ge 0$$ . - Hinge loss:
$$\ell(y,f)=\max(0,1-yf)$$ for$$y\in{-1,+1}$$ . - Kernel trick: use
$$K(x,x')=\langle \phi(x),\phi(x')\rangle$$ to compute in high-dimensional feature spaces without explicit$$\phi$$ . - Common kernels: linear
$$K=x^Tx'$$ , polynomial$$K=(\gamma x^Tx' + r)^d$$ , RBF$$K=\exp(-\gamma\lVert x-x'\rVert^2)$$ .
digraph G {
graph [fontname=Helvetica];
node [shape=rounded, style=filled, fillcolor="#eaf7f0", color="#cccccc", fontname=Helvetica];
edge [color="#bbbbbb"];
Data[label="Data (y∈{-1,+1})"];
Margin[label="Maximize margin"];
Slack[label="Slack ξ with C"];
Kernel[label="Kernel K(x,x')"];
Decision[label="Decision sign(w^Tφ(x)+b)"];
Data -> Margin -> Slack -> Kernel -> Decision;
}- Text classification (linear kernel) where features are high-dimensional sparse vectors.
- Image classification with RBF kernel for nonlinear boundaries.
- Primal soft-margin (above); dual leads to support vectors with coefficients
$$\alpha_i$$ . - Decision function (kernelized):
$$f(x)=\text{sign}\big(\sum_i \alpha_i y_i K(x_i,x)+b\big)$$ . - Hyperparameters:
$$C$$ trades margin vs misclassification; RBF$$\gamma$$ controls locality.
- “MSC”: Margin → Slack (
$$C$$ ) → Kernel. High$$C$$ = less slack; low$$C$$ = more slack.
- Not scaling features; SVMs (especially RBF) are scale-sensitive.
- Tuning
$$C$$ /$$\gamma$$ without cross-validation → overfitting.
- SVM maximizes margin; hinge loss drives large margins; kernels provide flexible nonlinearity.
Many problems need more than two classes, structured predictions (sequences, trees), or ranked lists. Extend binary methods to these settings.
- Multiclass strategies:
- One-vs-Rest (OvR): train
$$K$$ binary classifiers; pick class with highest score. - One-vs-One (OvO): train
$$K(K-1)/2$$ pairwise classifiers with voting. - Multinomial (softmax): single model with
$$P(y=c\mid x)=\frac{e^{w_c^Tx}}{\sum_k e^{w_k^Tx}}$$ .
- One-vs-Rest (OvR): train
- Structured outputs: model dependencies among outputs (e.g., CRFs, structured SVMs) for sequences, segmentations.
- Ranking: learn a scoring function
$$s(x)$$ so that relevant items rank higher; pairwise (e.g., RankSVM) or listwise losses. - Metrics: macro/micro-F1 for multiclass; MAP, NDCG for ranking.
digraph G {
graph [fontname=Helvetica];
node [shape=rounded, style=filled, fillcolor="#fff1e6", color="#cccccc", fontname=Helvetica];
edge [color="#bbbbbb"];
M[label="Multiclass (K)"];
OvR[label="OvR: K binaries"];
OvO[label="OvO: K(K-1)/2"];
Soft[label="Softmax: one model"];
M -> OvR; M -> OvO; M -> Soft;
}- Multiclass: handwritten digit recognition (0–9).
- Structured: part-of-speech tagging, image segmentation.
- Ranking: search engines ordering results, top-N recommendations.
- Softmax probabilities (above); loss: cross-entropy
$$L=-\sum_i \log P(y_i\mid x_i)$$ . - Pairwise ranking constraint: encourage
$$s(x^+) > s(x^-)$$ by a margin. - Inference in structured models uses dynamic programming (e.g., Viterbi for sequences).
- “3M for multiclass”: OvR, OvO, Multinomial (softmax).
- Ranking metrics: “N then M”: NDCG first, MAP second in IR exams.
- Using only accuracy for class-imbalanced multiclass; prefer macro-F1.
- Ignoring output dependencies in structured tasks → suboptimal predictions.
- Extend binary models via OvR/OvO/Softmax; structured methods capture label relations; ranking optimizes order-based metrics.
digraph G {
graph [fontname=Helvetica, rankdir=LR];
node [shape=rounded, style=filled, fillcolor="#f2f7ff", color="#cccccc", fontname=Helvetica];
edge [color="#bbbbbb"];
SL[label="Supervised Learning"];
Reg[label="Regression\n(Linear, kNN, Trees, GLM, SVR)"];
Clf[label="Classification\n(Logistic, NB, SVM, kNN, Trees)"];
Kern[label="Kernels\n(RBF, Poly, Linear)"];
Multi[label="Beyond Binary\n(OvR, OvO, Softmax, Structured, Ranking)"];
SL -> Reg; SL -> Clf; SL -> Kern; SL -> Multi;
}| Algorithm | Task | Objective/Loss | Pros | Cons |
| ------------------- | -------------- | -------------------------------------- | ----------------------------------- | ------------------------------- | ----------------------- |
| kNN | Both | Vote/mean of nearest via
- Trees “GE-M”: Gini, Entropy, MSE.
- Regularization “RLL”: Ridge=$$\ell_2$$, Lasso=$$\ell_1$$, Elastic Net=mix.
- SVM “MSC”: Margin–Slack(
$$C$$ )–Kernel. - Multiclass “3M”: OvR, OvO, Multinomial.
- kNN “SKM”: Scale →
$$k$$ → Metric.
- kNN: define; distance metrics; effect of
$$k$$ ; scaling; time complexity; pros/cons; tiny numeric example; curse of dimensionality. - Decision Trees: impurities (formulas); info gain; stopping/pruning; example split; Gini vs entropy comparison; overfitting control.
- Naive Bayes: Bayes rule; independence assumption; variants (Gaussian/Multinomial/Bernoulli); Laplace smoothing formula; text example; limitations.
- Linear vs Logistic: model forms; losses (MSE vs cross-entropy); metrics; regularization; when to use which; interpretability.
- SVM + Kernels: margin intuition; soft vs hard margin; hinge loss; kernels (RBF, polynomial); roles of
$$C$$ and$$\gamma$$ ; scaling. - Multiclass/Structured/Ranking: OvR vs OvO vs softmax; structured prediction definition; ranking losses and metrics.
- In one sentence: when is softmax preferred over OvR? What’s the trade-off?
- Write the hinge loss and explain when it is zero.
- For Multinomial NB, write the smoothed parameter estimate and explain
$$\alpha$$ . - Why does kNN often degrade in very high dimensions? What can you change?
- scikit-learn: Naive Bayes overview and variants.[1]
- scikit-learn: SVM user guide and SVC API.[2][3]
- Wikipedia: Support Vector Machine primer (geometry, margins).[4]
- Bernoulli/Multinomial NB tutorials and explanations.[5][6]
Note: Use these to cross-check formulas and deepen intuition before exams.
- For each algorithm, can you state: objective/loss, decision rule, two pros, two cons, and one tuning hyperparameter?
- Can you write the key formulas from memory: entropy, Gini, softmax, hinge loss, Ridge/Lasso objectives?
- Have you practiced picking metrics for imbalanced classification? ROC-AUC vs PR-AUC rationale?
If you tell me your course or grade level specifics (university, exam pattern), I can tailor the emphasis, add solved past questions, or generate a focused practice set next.