PHASE 1: INTRODUCTION TO PREDICTIVE ANALYTICS
PHASE 2: SUPERVISED LEARNING - REGRESSION
PHASE 3: SUPERVISED LEARNING - CLASSIFICATION
PHASE 4: UNSUPERVISED LEARNING
PHASE 5: DIMENSIONALITY REDUCTION & NEURAL NETWORKS
PHASE 6: MODEL PERFORMANCE & ENSEMBLE METHODS
Predictive Analytics = Using historical data to predict future outcomes.
Example:
💡 Engineering analogy:
Think of it like this:
Just like signal processing.
Machine Learning = Giving computer ability to learn patterns from data without explicitly programming rules.
Traditional Programming:
Input + Rules → Output
Machine Learning:
Input + Output → Model → Predict future Output
🔹 Supervised Learning
(Data has labels)
Example:
🔹 Unsupervised Learning
(No labels)
Example:
🔹 Reinforcement Learning
(Not in your syllabus but good to know)
Agent learns by reward & punishment.
Raw data is messy.
Real-world data contains:
🔹 Handling Missing Data
Options:
🔹 Encoding Categorical Variables
Machine understands numbers, not text.
Example:
Gender:
🔹 Normalization
Used when scale differs.
Example:
If not normalized → model becomes biased.
EDA means: Understanding data before modeling.
We use:
Correlation measures: How strongly two variables are related.
Value range:
Example:
This is where math meets data.
Regression = predicting continuous values.
Examples:
Regression is used when output is numerical.
Example:
| House Size | Price |
|---|---|
| 1000 sq ft | 20L |
| 1500 sq ft | 30L |
| 2000 sq ft | 40L |
We want to find a function:
y = f(x)
Where:
The most basic regression model.
Formula:
y = mx + c
Where:
This equation gives us a straight line.
What Does Slope (m) Mean?
Slope tells: How much y changes when x increases by 1.
Example:
If m = 5
Then for every 1 unit increase in x, y increases by 5.
What is Intercept (c)?
When x = 0
y = c
It is starting value.
Don't memorize blindly. Understand logic.
Now instead of one input, we have many.
Formula:
y = b₀ + b₁x₁ + b₂x₂ + b₃x₃ + ...
Example:
Predict house price using:
More realistic model.
Sometimes relationship is not straight line.
Example:
Then equation becomes:
y = a + bx + cx²
Still regression. But curve instead of line.
Despite name "regression", it is used for classification.
Used for:
Output range: 0 to 1
Uses sigmoid function:
σ(z) = 1 / (1 + e⁻ᶻ)
Graph looks like S-curve.
Core idea:
Find line that minimizes squared error.
Error = actual - predicted
OLS minimizes:
Σ(y - ŷ)²
Why square?
🔹 MAE (Mean Absolute Error)
MAE = (1/n) Σ |y - ŷ|
Simple average of absolute errors.
🔹 MSE (Mean Squared Error)
MSE = (1/n) Σ(y - ŷ)²
Punishes large errors more.
🔹 RMSE
RMSE = √MSE
Brings error back to original unit.
🔹 R² (Coefficient of Determination)
Tells how much variance model explains.
Range:
Formula:
R² = 1 - (SS_res / SS_tot)
Correlation:
Regression:
Correlation ≠ Causation
This phase decides:
Classification = Predicting categories.
Classification predicts discrete labels.
Examples:
Unlike regression:
🔹 Binary Classification
Two classes:
🔹 Multi-class Classification
More than two classes:
Now we move to algorithms in your syllabus.
One of the simplest algorithms.
Core Idea:
To classify a new data point:
How KNN Works
Euclidean Distance Formula
d = √((x₁ - x₂)² + (y₁ - y₂)²)
Important:
Based on Bayes Theorem.
P(A|B) = P(B|A)P(A) / P(B)
Used heavily in:
Why "Naive"?
Because it assumes features are independent.
Which is rarely true in real life.
But works surprisingly well.
One of the most intuitive models.
It splits data based on conditions.
Example:
If age > 25 → then check salary → then classify
How It Decides Splits
Uses:
Problem:
Decision trees can overfit easily.
One of the most powerful classifiers.
Core idea:
Find a hyperplane that separates classes with maximum margin.
Key Idea:
Closest points = Support Vectors.
Kernel Trick (Advanced Concept)
Used when data is not linearly separable.
Common kernels:
After building classifier, we evaluate it.
Confusion matrix looks like this:
| Predicted Yes | Predicted No | |
|---|---|---|
| Actual Yes | TP | FN |
| Actual No | FP | TN |
🔹 Accuracy
Accuracy = (TP + TN) / Total
But accuracy can be misleading.
Example:
If 95% data is negative,
Predicting always negative gives 95% accuracy.
But model is useless.
🔹 Precision
Precision = TP / (TP + FP)
How many predicted positives are correct?
Important in:
🔹 Recall
Recall = TP / (TP + FN)
How many actual positives did we detect?
Important in:
🔹 F1 Score
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Balances precision & recall.
🔹 AUC & ROC Curve
ROC curve plots:
AUC measures area under curve.
Higher AUC → better model.
This is where:
In supervised learning:
Data = Input + Output (labels)
In unsupervised learning:
Data = Only Input
Model must find:
Real Life Examples
Clustering = grouping similar data points together.
Goal:
One of the most popular clustering algorithms.
Core Idea:
How Distance is Measured?
Usually Euclidean distance:
d = √((x₁ - x₂)² + (y₁ - y₂)²)
Choosing wrong K = wrong clustering.
🔹 Elbow Method
Plot:
Look for elbow point.
Instead of choosing K initially,
It builds a hierarchy of clusters.
Two types:
🔹 Agglomerative (Bottom-up)
🔹 Divisive (Top-down)
Linkage Methods
How clusters are merged:
Each affects shape of clusters.
Used in: Market Basket Analysis.
Example:
Customers who buy bread also buy butter.
🔹 Important Terms
Support
How frequently item appears.
Support(A) = Transactions containing A / Total Transactions
Confidence
Probability of buying B given A.
Confidence(A → B) = Support(A,B) / Support(A)
Lift
Measures strength of rule.
If Lift > 1 → positive association.
Used to find frequent itemsets.
Steps:
This phase is where mathematics + optimization + AI meet.
Imagine dataset with:
Problems:
More dimensions ≠ better model.
Sometimes: More dimensions = more noise.
As dimensions increase:
This is why reducing dimensions helps.
Most important dimensionality reduction technique.
Core Idea:
Convert correlated variables into fewer uncorrelated variables.
These new variables are called: Principal Components.
What PCA Does
Instead of original features:
X1, X2, X3, X4…
PCA creates:
PC1, PC2, PC3…
Where:
You don't need to derive eigenvalues in exam. Understand logic.
Each principal component explains some variance.
Example:
Together → 85%
So you can reduce 10 features to 2 features while retaining 85% information.
Now we enter AI territory.
Inspired by human brain.
Basic unit = Neuron.
Each neuron:
Mathematical form:
Output = Activation(w₁x₁ + w₂x₂ + b)
Data flows:
Input → Hidden Layer → Output
No cycles.
Used for:
An MLP has:
This is what people usually mean by "Neural Network".
Activation decides output behavior.
Common ones:
Sigmoid
ReLU
f(x) = max(0, x)
Used in:
Instead of normal neurons, CNN uses filters to detect patterns.
Example:
Used for sequential data:
RNN remembers previous information.
Unlike feedforward, It has memory.
Because:
Some patterns are too complex for:
Neural networks handle:
This phase separates beginners from serious data professionals.
This is the heart of machine learning.
Bias = error due to wrong assumptions.
Example:
Fitting a straight line to curved data.
Model is too simple.
Result:
👉 Underfitting
Variance = error due to model being too sensitive.
Example:
Model memorizes training data exactly.
Result:
👉 Overfitting
Underfitting:
Overfitting:
Goal:
👉 Balance both
Training accuracy alone is dangerous.
We need to test model reliability.
Basic method:
Split data:
But problem:
Result depends on random split.
Process:
Extreme case of K-fold:
K = number of samples
Very expensive computationally.
Ensemble = combining multiple models.
Core Idea:
Many weak models together become strong.
Example: Random Forest.
Process:
Reduces:
👉 Variance
Most popular ensemble algorithm.
It is:
👉 Many decision trees combined.
Each tree:
Final output:
Why powerful?
Unlike bagging:
Each new model:
Focuses on correcting previous errors.
Popular Boosting Methods:
| Bagging | Boosting |
|---|---|
| Reduces variance | Reduces bias |
| Parallel training | Sequential training |
| Example: Random Forest | Example: AdaBoost |