My Notes

Go To Home Page

Predictive Analytics Notes

Table of Content

PHASE 1: INTRODUCTION TO PREDICTIVE ANALYTICS

PHASE 2: SUPERVISED LEARNING - REGRESSION

PHASE 3: SUPERVISED LEARNING - CLASSIFICATION

PHASE 4: UNSUPERVISED LEARNING

PHASE 5: DIMENSIONALITY REDUCTION & NEURAL NETWORKS

PHASE 6: MODEL PERFORMANCE & ENSEMBLE METHODS

🔵 PHASE 1: INTRODUCTION TO PREDICTIVE ANALYTICS

1️⃣ What is Predictive Analytics?

Predictive Analytics = Using historical data to predict future outcomes.

Example:

Predict house price
Predict student performance
Predict loan default
Predict customer churn

💡 Engineering analogy:

Think of it like this:

You observe system behavior over time
You detect pattern
You create equation
You predict next value

Just like signal processing.

2️⃣ What is Machine Learning?

Machine Learning = Giving computer ability to learn patterns from data without explicitly programming rules.

Traditional Programming:

Input + Rules → Output

Machine Learning:

Input + Output → Model → Predict future Output

3️⃣ Types of Machine Learning (From Your Syllabus)

🔹 Supervised Learning

(Data has labels)

Example:

House size → price
Email → spam/not spam

🔹 Unsupervised Learning

(No labels)

Example:

Customer segmentation
Grouping similar products

🔹 Reinforcement Learning

(Not in your syllabus but good to know)

Agent learns by reward & punishment.

4️⃣ Data Preprocessing (VERY IMPORTANT – 40% of real work)

Raw data is messy.

Real-world data contains:

Missing values
Outliers
Noise
Inconsistent format

🔹 Handling Missing Data

Options:

Remove rows
Fill with mean
Fill with median
Fill with mode

🔹 Encoding Categorical Variables

Machine understands numbers, not text.

Example:

Gender:

Male → 0
Female → 1

🔹 Normalization

Used when scale differs.

Example:

Age → 18 to 60
Salary → 10,000 to 5,00,000

If not normalized → model becomes biased.

5️⃣ Exploratory Data Analysis (EDA)

EDA means: Understanding data before modeling.

We use:

Mean
Median
Standard deviation
Correlation
Histograms
Boxplots

6️⃣ Correlation (Very Important for Regression Later)

Correlation measures: How strongly two variables are related.

Value range:

+1 → perfect positive
0 → no relation
-1 → perfect negative

Example:

Study hours vs marks → positive
Speed vs time → negative

🔵 PHASE 2: SUPERVISED LEARNING – REGRESSION

This is where math meets data.

Regression = predicting continuous values.

Examples:

House price
Temperature
Sales forecast
Stock value

1️⃣ What is Regression?

Regression is used when output is numerical.

Example:

House Size	Price
1000 sq ft	20L
1500 sq ft	30L
2000 sq ft	40L

We want to find a function:

y = f(x)

Where:

x = input
y = output

2️⃣ Simple Linear Regression (SLR)

The most basic regression model.

Formula:

y = mx + c

Where:

m = slope
c = intercept

This equation gives us a straight line.

What Does Slope (m) Mean?

Slope tells: How much y changes when x increases by 1.

Example:

If m = 5

Then for every 1 unit increase in x, y increases by 5.

What is Intercept (c)?

When x = 0

y = c

It is starting value.

3️⃣ Assumptions of Linear Regression (Interview Important)

Linearity
Independence
Homoscedasticity
Normal distribution of errors

Don't memorize blindly. Understand logic.

4️⃣ Multiple Linear Regression (MLR)

Now instead of one input, we have many.

Formula:

y = b₀ + b₁x₁ + b₂x₂ + b₃x₃ + ...

Example:

Predict house price using:

Size
Bedrooms
Location score

More realistic model.

5️⃣ Polynomial Regression

Sometimes relationship is not straight line.

Example:

Growth curves
Engineering stress-strain curves

Then equation becomes:

y = a + bx + cx²

Still regression. But curve instead of line.

6️⃣ Logistic Regression (Classification but in Regression Unit)

Despite name "regression", it is used for classification.

Used for:

Yes / No
Pass / Fail
Spam / Not Spam

Output range: 0 to 1

Uses sigmoid function:

σ(z) = 1 / (1 + e⁻ᶻ)

Graph looks like S-curve.

7️⃣ Ordinary Least Squares (OLS)

Core idea:

Find line that minimizes squared error.

Error = actual - predicted

OLS minimizes:

Σ(y - ŷ)²

Why square?

Avoid negative cancellation
Penalize large errors more

8️⃣ Error Metrics (VERY IMPORTANT FOR EXAM)

🔹 MAE (Mean Absolute Error)

MAE = (1/n) Σ |y - ŷ|

Simple average of absolute errors.

🔹 MSE (Mean Squared Error)

MSE = (1/n) Σ(y - ŷ)²

Punishes large errors more.

🔹 RMSE

RMSE = √MSE

Brings error back to original unit.

🔹 R² (Coefficient of Determination)

Tells how much variance model explains.

Range:

0 → useless
1 → perfect

Formula:

R² = 1 - (SS_res / SS_tot)

9️⃣ Correlation vs Regression (Important Difference)

Correlation:

Measures relationship strength

Regression:

Predicts value

Correlation ≠ Causation

🔵 PHASE 3: SUPERVISED LEARNING – CLASSIFICATION

This phase decides:

Fraud detection
Disease prediction
Spam detection
Churn prediction
Placement prediction

Classification = Predicting categories.

1️⃣ What is Classification?

Classification predicts discrete labels.

Examples:

Spam / Not Spam
Pass / Fail
Yes / No
Cat / Dog

Unlike regression:

Regression → continuous output
Classification → categorical output

2️⃣ Types of Classification

🔹 Binary Classification

Two classes:

0 or 1
Yes or No

🔹 Multi-class Classification

More than two classes:

Grade A, B, C
Type 1, 2, 3

Now we move to algorithms in your syllabus.

3️⃣ K-Nearest Neighbors (KNN)

One of the simplest algorithms.

Core Idea:

To classify a new data point:

Look at K nearest data points
Take majority vote

How KNN Works

Choose K (e.g., 3 or 5)
Calculate distance (usually Euclidean)
Pick nearest K points
Majority class wins

Euclidean Distance Formula

d = √((x₁ - x₂)² + (y₁ - y₂)²)

Important:

Small K → high variance
Large K → high bias

4️⃣ Naive Bayes

Based on Bayes Theorem.

P(A|B) = P(B|A)P(A) / P(B)

Used heavily in:

Spam filtering
Text classification

Why "Naive"?

Because it assumes features are independent.

Which is rarely true in real life.

But works surprisingly well.

5️⃣ Decision Tree

One of the most intuitive models.

It splits data based on conditions.

Example:

If age > 25 → then check salary → then classify

How It Decides Splits

Uses:

Gini Index
Entropy
Information Gain

Problem:

Decision trees can overfit easily.

6️⃣ Support Vector Machine (SVM)

One of the most powerful classifiers.

Core idea:

Find a hyperplane that separates classes with maximum margin.

Key Idea:

Not just separating line,
But the line with maximum distance from closest points.

Closest points = Support Vectors.

Kernel Trick (Advanced Concept)

Used when data is not linearly separable.

Common kernels:

Linear
Polynomial
RBF

7️⃣ Confusion Matrix (VERY IMPORTANT)

After building classifier, we evaluate it.

Confusion matrix looks like this:

	Predicted Yes	Predicted No
Actual Yes	TP	FN
Actual No	FP	TN

8️⃣ Evaluation Metrics

🔹 Accuracy

Accuracy = (TP + TN) / Total

But accuracy can be misleading.

Example:

If 95% data is negative,

Predicting always negative gives 95% accuracy.

But model is useless.

🔹 Precision

Precision = TP / (TP + FP)

How many predicted positives are correct?

Important in:

Spam detection

🔹 Recall

Recall = TP / (TP + FN)

How many actual positives did we detect?

Important in:

Disease detection

🔹 F1 Score

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Balances precision & recall.

🔹 AUC & ROC Curve

ROC curve plots:

True Positive Rate
False Positive Rate

AUC measures area under curve.

Higher AUC → better model.

🔵 PHASE 4: UNSUPERVISED LEARNING

This is where:

👉 There are no labels
👉 No correct answers given
👉 Model must discover patterns on its own

1️⃣ What is Unsupervised Learning?

In supervised learning:

Data = Input + Output (labels)

In unsupervised learning:

Data = Only Input

Model must find:

Patterns
Groups
Hidden structure

Real Life Examples

Customer segmentation
Market basket analysis
Fraud anomaly detection
Image grouping

2️⃣ Clustering

Clustering = grouping similar data points together.

Goal:

Maximize similarity within cluster
Minimize similarity between clusters

3️⃣ K-Means Clustering

One of the most popular clustering algorithms.

Core Idea:

Choose K (number of clusters)
Randomly initialize K centroids
Assign each point to nearest centroid
Update centroid as mean of cluster
Repeat until stable

How Distance is Measured?

Usually Euclidean distance:

d = √((x₁ - x₂)² + (y₁ - y₂)²)

4️⃣ Choosing the Value of K (VERY IMPORTANT)

Choosing wrong K = wrong clustering.

🔹 Elbow Method

Plot:

X-axis → number of clusters (K)
Y-axis → WCSS (Within Cluster Sum of Squares)

Look for elbow point.

5️⃣ Limitations of K-Means

Sensitive to initial centroid
Assumes spherical clusters
Struggles with different densities
Must choose K manually

6️⃣ Hierarchical Clustering

Instead of choosing K initially,

It builds a hierarchy of clusters.

Two types:

🔹 Agglomerative (Bottom-up)

Start: Each point is its own cluster.
Then: Merge closest clusters step by step.

🔹 Divisive (Top-down)

Start: All data in one cluster.
Then: Split recursively.

Linkage Methods

How clusters are merged:

Single linkage (min distance)
Complete linkage (max distance)
Average linkage
Ward's method

Each affects shape of clusters.

7️⃣ Association Rule Mining

Used in: Market Basket Analysis.

Example:

Customers who buy bread also buy butter.

🔹 Important Terms

Support
Confidence
Lift

Support

How frequently item appears.

Support(A) = Transactions containing A / Total Transactions

Confidence

Probability of buying B given A.

Confidence(A → B) = Support(A,B) / Support(A)

Lift

Measures strength of rule.

If Lift > 1 → positive association.

8️⃣ Apriori Algorithm (Basic Idea)

Used to find frequent itemsets.

Steps:

Find frequent single items
Extend to pairs
Remove low-support sets
Repeat

🔵 PHASE 5: DIMENSIONALITY REDUCTION & NEURAL NETWORKS

This phase is where mathematics + optimization + AI meet.

🔹 PART 1: DIMENSIONALITY REDUCTION

1️⃣ Why Dimensionality Reduction?

Imagine dataset with:

100 features
1000 features
10,000 features

Problems:

Computationally expensive
Hard to visualize
Risk of overfitting
Curse of dimensionality

More dimensions ≠ better model.

Sometimes: More dimensions = more noise.

2️⃣ Curse of Dimensionality

As dimensions increase:

Data points become sparse
Distance metrics lose meaning
Model becomes unstable

This is why reducing dimensions helps.

3️⃣ Principal Component Analysis (PCA)

Most important dimensionality reduction technique.

Core Idea:

Convert correlated variables into fewer uncorrelated variables.

These new variables are called: Principal Components.

What PCA Does

Instead of original features:

X1, X2, X3, X4…

PCA creates:

PC1, PC2, PC3…

Where:

PC1 captures maximum variance
PC2 captures next maximum variance
And so on.

4️⃣ How PCA Works (Conceptual)

Standardize data
Compute covariance matrix
Compute eigenvalues & eigenvectors
Select top components
Transform data

You don't need to derive eigenvalues in exam. Understand logic.

5️⃣ Explained Variance

Each principal component explains some variance.

Example:

PC1 → 60%
PC2 → 25%

Together → 85%

So you can reduce 10 features to 2 features while retaining 85% information.

🔵 PART 2: NEURAL NETWORKS

Now we enter AI territory.

6️⃣ What is a Neural Network?

Inspired by human brain.

Basic unit = Neuron.

Each neuron:

Takes input
Applies weights
Adds bias
Applies activation function

Mathematical form:

Output = Activation(w₁x₁ + w₂x₂ + b)

7️⃣ Feedforward Neural Network (FNN)

Data flows:

Input → Hidden Layer → Output

No cycles.

Used for:

Classification
Regression

8️⃣ Multi-Layer Perceptron (MLP)

An MLP has:

Input layer
Multiple hidden layers
Output layer

This is what people usually mean by "Neural Network".

9️⃣ Activation Functions

Activation decides output behavior.

Common ones:

Sigmoid
ReLU
Tanh

Sigmoid

Output between 0 and 1
Used in binary classification.

ReLU

f(x) = max(0, x)

Most commonly used.
Fast & efficient.

🔟 Convolutional Neural Network (CNN)

Used in:

Image recognition
Computer vision

Instead of normal neurons, CNN uses filters to detect patterns.

Example:

Edge detection
Texture
Shapes

1️⃣1️⃣ Recurrent Neural Network (RNN)

Used for sequential data:

Text
Speech
Time series

RNN remembers previous information.

Unlike feedforward, It has memory.

🔵 Why Neural Networks Matter in Predictive Analytics

Because:

Some patterns are too complex for:

Linear regression
Decision trees

Neural networks handle:

Non-linear relationships
High-dimensional data
Image & text data

🔵 PHASE 6: MODEL PERFORMANCE & ENSEMBLE METHODS

This phase separates beginners from serious data professionals.

🔹 PART 1: Bias–Variance Tradeoff

This is the heart of machine learning.

1️⃣ What is Bias?

Bias = error due to wrong assumptions.

Example:

Fitting a straight line to curved data.

Model is too simple.

Result:

👉 Underfitting

2️⃣ What is Variance?

Variance = error due to model being too sensitive.

Example:

Model memorizes training data exactly.

Result:

👉 Overfitting

3️⃣ Underfitting vs Overfitting

Underfitting:

High bias
Low training accuracy
Low testing accuracy

Overfitting:

Low training error
High testing error

Goal:

👉 Balance both

🔹 PART 2: Cross Validation

Training accuracy alone is dangerous.

We need to test model reliability.

4️⃣ Train-Test Split

Basic method:

Split data:

70% training
30% testing

But problem:

Result depends on random split.

5️⃣ K-Fold Cross Validation (VERY IMPORTANT)

Process:

Divide data into K equal parts
Train on K-1 parts
Test on remaining part
Repeat K times
Average performance

6️⃣ Leave-One-Out (LOO)

Extreme case of K-fold:

K = number of samples

Very expensive computationally.

🔹 PART 3: Ensemble Methods

Ensemble = combining multiple models.

Core Idea:

Many weak models together become strong.

7️⃣ Bagging (Bootstrap Aggregating)

Example: Random Forest.

Process:

Create multiple datasets by sampling with replacement
Train model on each dataset
Combine predictions (average or voting)

Reduces:

👉 Variance

8️⃣ Random Forest

9️⃣ Boosting

Unlike bagging:

Bagging → models independent
Boosting → models sequential

Each new model:

Focuses on correcting previous errors.

Popular Boosting Methods:

AdaBoost
Gradient Boosting
XGBoost

🔟 Bagging vs Boosting

Bagging	Boosting
Reduces variance	Reduces bias
Parallel training	Sequential training
Example: Random Forest	Example: AdaBoost

Mini Project of using Next.js and Tailwind CSS