My Notes

Predictive Analytics Notes

Table of Content

PHASE 1: INTRODUCTION TO PREDICTIVE ANALYTICS


PHASE 2: SUPERVISED LEARNING - REGRESSION


PHASE 3: SUPERVISED LEARNING - CLASSIFICATION


PHASE 4: UNSUPERVISED LEARNING


PHASE 5: DIMENSIONALITY REDUCTION & NEURAL NETWORKS


PHASE 6: MODEL PERFORMANCE & ENSEMBLE METHODS


🔵 PHASE 1: INTRODUCTION TO PREDICTIVE ANALYTICS

1️⃣ What is Predictive Analytics?

Predictive Analytics = Using historical data to predict future outcomes.

Example:

  • Predict house price
  • Predict student performance
  • Predict loan default
  • Predict customer churn

💡 Engineering analogy:

Think of it like this:

  • You observe system behavior over time
  • You detect pattern
  • You create equation
  • You predict next value

Just like signal processing.

2️⃣ What is Machine Learning?

Machine Learning = Giving computer ability to learn patterns from data without explicitly programming rules.

Traditional Programming:

Input + Rules → Output

Machine Learning:

Input + Output → Model → Predict future Output

3️⃣ Types of Machine Learning (From Your Syllabus)

🔹 Supervised Learning

(Data has labels)

Example:

  • House size → price
  • Email → spam/not spam

🔹 Unsupervised Learning

(No labels)

Example:

  • Customer segmentation
  • Grouping similar products

🔹 Reinforcement Learning

(Not in your syllabus but good to know)

Agent learns by reward & punishment.

4️⃣ Data Preprocessing (VERY IMPORTANT – 40% of real work)

Raw data is messy.

Real-world data contains:

  • Missing values
  • Outliers
  • Noise
  • Inconsistent format

🔹 Handling Missing Data

Options:

  • Remove rows
  • Fill with mean
  • Fill with median
  • Fill with mode

🔹 Encoding Categorical Variables

Machine understands numbers, not text.

Example:

Gender:

  • Male → 0
  • Female → 1

🔹 Normalization

Used when scale differs.

Example:

  • Age → 18 to 60
  • Salary → 10,000 to 5,00,000

If not normalized → model becomes biased.

5️⃣ Exploratory Data Analysis (EDA)

EDA means: Understanding data before modeling.

We use:

  • Mean
  • Median
  • Standard deviation
  • Correlation
  • Histograms
  • Boxplots

6️⃣ Correlation (Very Important for Regression Later)

Correlation measures: How strongly two variables are related.

Value range:

  • +1 → perfect positive
  • 0 → no relation
  • -1 → perfect negative

Example:

  • Study hours vs marks → positive
  • Speed vs time → negative

🔵 PHASE 2: SUPERVISED LEARNING – REGRESSION

This is where math meets data.

Regression = predicting continuous values.

Examples:

  • House price
  • Temperature
  • Sales forecast
  • Stock value

1️⃣ What is Regression?

Regression is used when output is numerical.

Example:

House SizePrice
1000 sq ft20L
1500 sq ft30L
2000 sq ft40L

We want to find a function:

y = f(x)

Where:

  • x = input
  • y = output

2️⃣ Simple Linear Regression (SLR)

The most basic regression model.

Formula:

y = mx + c

Where:

  • m = slope
  • c = intercept

This equation gives us a straight line.

What Does Slope (m) Mean?

Slope tells: How much y changes when x increases by 1.

Example:

If m = 5

Then for every 1 unit increase in x, y increases by 5.

What is Intercept (c)?

When x = 0

y = c

It is starting value.

3️⃣ Assumptions of Linear Regression (Interview Important)

  • Linearity
  • Independence
  • Homoscedasticity
  • Normal distribution of errors

Don't memorize blindly. Understand logic.

4️⃣ Multiple Linear Regression (MLR)

Now instead of one input, we have many.

Formula:

y = b₀ + b₁x₁ + b₂x₂ + b₃x₃ + ...

Example:

Predict house price using:

  • Size
  • Bedrooms
  • Location score

More realistic model.

5️⃣ Polynomial Regression

Sometimes relationship is not straight line.

Example:

  • Growth curves
  • Engineering stress-strain curves

Then equation becomes:

y = a + bx + cx²

Still regression. But curve instead of line.

6️⃣ Logistic Regression (Classification but in Regression Unit)

Despite name "regression", it is used for classification.

Used for:

  • Yes / No
  • Pass / Fail
  • Spam / Not Spam

Output range: 0 to 1

Uses sigmoid function:

σ(z) = 1 / (1 + e⁻ᶻ)

Graph looks like S-curve.

7️⃣ Ordinary Least Squares (OLS)

Core idea:

Find line that minimizes squared error.

Error = actual - predicted

OLS minimizes:

Σ(y - ŷ)²

Why square?

  • Avoid negative cancellation
  • Penalize large errors more

8️⃣ Error Metrics (VERY IMPORTANT FOR EXAM)

🔹 MAE (Mean Absolute Error)

MAE = (1/n) Σ |y - ŷ|

Simple average of absolute errors.

🔹 MSE (Mean Squared Error)

MSE = (1/n) Σ(y - ŷ)²

Punishes large errors more.

🔹 RMSE

RMSE = √MSE

Brings error back to original unit.

🔹 R² (Coefficient of Determination)

Tells how much variance model explains.

Range:

  • 0 → useless
  • 1 → perfect

Formula:

R² = 1 - (SS_res / SS_tot)

9️⃣ Correlation vs Regression (Important Difference)

Correlation:

  • Measures relationship strength

Regression:

  • Predicts value

Correlation ≠ Causation

🔵 PHASE 3: SUPERVISED LEARNING – CLASSIFICATION

This phase decides:

  • Fraud detection
  • Disease prediction
  • Spam detection
  • Churn prediction
  • Placement prediction

Classification = Predicting categories.

1️⃣ What is Classification?

Classification predicts discrete labels.

Examples:

  • Spam / Not Spam
  • Pass / Fail
  • Yes / No
  • Cat / Dog

Unlike regression:

  • Regression → continuous output
  • Classification → categorical output

2️⃣ Types of Classification

🔹 Binary Classification

Two classes:

  • 0 or 1
  • Yes or No

🔹 Multi-class Classification

More than two classes:

  • Grade A, B, C
  • Type 1, 2, 3

Now we move to algorithms in your syllabus.

3️⃣ K-Nearest Neighbors (KNN)

One of the simplest algorithms.

Core Idea:

To classify a new data point:

  • Look at K nearest data points
  • Take majority vote

How KNN Works

  1. Choose K (e.g., 3 or 5)
  2. Calculate distance (usually Euclidean)
  3. Pick nearest K points
  4. Majority class wins

Euclidean Distance Formula

d = √((x₁ - x₂)² + (y₁ - y₂)²)

Important:

  • Small K → high variance
  • Large K → high bias

4️⃣ Naive Bayes

Based on Bayes Theorem.

P(A|B) = P(B|A)P(A) / P(B)

Used heavily in:

  • Spam filtering
  • Text classification

Why "Naive"?

Because it assumes features are independent.

Which is rarely true in real life.

But works surprisingly well.

5️⃣ Decision Tree

One of the most intuitive models.

It splits data based on conditions.

Example:

If age > 25 → then check salary → then classify

How It Decides Splits

Uses:

  • Gini Index
  • Entropy
  • Information Gain

Problem:

Decision trees can overfit easily.

6️⃣ Support Vector Machine (SVM)

One of the most powerful classifiers.

Core idea:

Find a hyperplane that separates classes with maximum margin.

Key Idea:

  • Not just separating line,
  • But the line with maximum distance from closest points.

Closest points = Support Vectors.

Kernel Trick (Advanced Concept)

Used when data is not linearly separable.

Common kernels:

  • Linear
  • Polynomial
  • RBF

7️⃣ Confusion Matrix (VERY IMPORTANT)

After building classifier, we evaluate it.

Confusion matrix looks like this:

Predicted YesPredicted No
Actual YesTPFN
Actual NoFPTN

8️⃣ Evaluation Metrics

🔹 Accuracy

Accuracy = (TP + TN) / Total

But accuracy can be misleading.

Example:

If 95% data is negative,

Predicting always negative gives 95% accuracy.

But model is useless.

🔹 Precision

Precision = TP / (TP + FP)

How many predicted positives are correct?

Important in:

  • Spam detection

🔹 Recall

Recall = TP / (TP + FN)

How many actual positives did we detect?

Important in:

  • Disease detection

🔹 F1 Score

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Balances precision & recall.

🔹 AUC & ROC Curve

ROC curve plots:

  • True Positive Rate
  • False Positive Rate

AUC measures area under curve.

Higher AUC → better model.

🔵 PHASE 4: UNSUPERVISED LEARNING

This is where:

  • 👉 There are no labels
  • 👉 No correct answers given
  • 👉 Model must discover patterns on its own

1️⃣ What is Unsupervised Learning?

In supervised learning:

Data = Input + Output (labels)

In unsupervised learning:

Data = Only Input

Model must find:

  • Patterns
  • Groups
  • Hidden structure

Real Life Examples

  • Customer segmentation
  • Market basket analysis
  • Fraud anomaly detection
  • Image grouping

2️⃣ Clustering

Clustering = grouping similar data points together.

Goal:

  • Maximize similarity within cluster
  • Minimize similarity between clusters

3️⃣ K-Means Clustering

One of the most popular clustering algorithms.

Core Idea:

  1. Choose K (number of clusters)
  2. Randomly initialize K centroids
  3. Assign each point to nearest centroid
  4. Update centroid as mean of cluster
  5. Repeat until stable

How Distance is Measured?

Usually Euclidean distance:

d = √((x₁ - x₂)² + (y₁ - y₂)²)

4️⃣ Choosing the Value of K (VERY IMPORTANT)

Choosing wrong K = wrong clustering.

🔹 Elbow Method

Plot:

  • X-axis → number of clusters (K)
  • Y-axis → WCSS (Within Cluster Sum of Squares)

Look for elbow point.

5️⃣ Limitations of K-Means

  • Sensitive to initial centroid
  • Assumes spherical clusters
  • Struggles with different densities
  • Must choose K manually

6️⃣ Hierarchical Clustering

Instead of choosing K initially,

It builds a hierarchy of clusters.

Two types:

🔹 Agglomerative (Bottom-up)

  • Start: Each point is its own cluster.
  • Then: Merge closest clusters step by step.

🔹 Divisive (Top-down)

  • Start: All data in one cluster.
  • Then: Split recursively.

Linkage Methods

How clusters are merged:

  • Single linkage (min distance)
  • Complete linkage (max distance)
  • Average linkage
  • Ward's method

Each affects shape of clusters.

7️⃣ Association Rule Mining

Used in: Market Basket Analysis.

Example:

Customers who buy bread also buy butter.

🔹 Important Terms

  • Support
  • Confidence
  • Lift

Support

How frequently item appears.

Support(A) = Transactions containing A / Total Transactions

Confidence

Probability of buying B given A.

Confidence(A → B) = Support(A,B) / Support(A)

Lift

Measures strength of rule.

If Lift > 1 → positive association.

8️⃣ Apriori Algorithm (Basic Idea)

Used to find frequent itemsets.

Steps:

  1. Find frequent single items
  2. Extend to pairs
  3. Remove low-support sets
  4. Repeat

🔵 PHASE 5: DIMENSIONALITY REDUCTION & NEURAL NETWORKS

This phase is where mathematics + optimization + AI meet.

🔹 PART 1: DIMENSIONALITY REDUCTION

1️⃣ Why Dimensionality Reduction?

Imagine dataset with:

  • 100 features
  • 1000 features
  • 10,000 features

Problems:

  • Computationally expensive
  • Hard to visualize
  • Risk of overfitting
  • Curse of dimensionality

More dimensions ≠ better model.

Sometimes: More dimensions = more noise.

2️⃣ Curse of Dimensionality

As dimensions increase:

  • Data points become sparse
  • Distance metrics lose meaning
  • Model becomes unstable

This is why reducing dimensions helps.

3️⃣ Principal Component Analysis (PCA)

Most important dimensionality reduction technique.

Core Idea:

Convert correlated variables into fewer uncorrelated variables.

These new variables are called: Principal Components.

What PCA Does

Instead of original features:

X1, X2, X3, X4…

PCA creates:

PC1, PC2, PC3…

Where:

  • PC1 captures maximum variance
  • PC2 captures next maximum variance
  • And so on.

4️⃣ How PCA Works (Conceptual)

  1. Standardize data
  2. Compute covariance matrix
  3. Compute eigenvalues & eigenvectors
  4. Select top components
  5. Transform data

You don't need to derive eigenvalues in exam. Understand logic.

5️⃣ Explained Variance

Each principal component explains some variance.

Example:

  • PC1 → 60%
  • PC2 → 25%

Together → 85%

So you can reduce 10 features to 2 features while retaining 85% information.

🔵 PART 2: NEURAL NETWORKS

Now we enter AI territory.

6️⃣ What is a Neural Network?

Inspired by human brain.

Basic unit = Neuron.

Each neuron:

  • Takes input
  • Applies weights
  • Adds bias
  • Applies activation function

Mathematical form:

Output = Activation(w₁x₁ + w₂x₂ + b)

7️⃣ Feedforward Neural Network (FNN)

Data flows:

Input → Hidden Layer → Output

No cycles.

Used for:

  • Classification
  • Regression

8️⃣ Multi-Layer Perceptron (MLP)

An MLP has:

  • Input layer
  • Multiple hidden layers
  • Output layer

This is what people usually mean by "Neural Network".

9️⃣ Activation Functions

Activation decides output behavior.

Common ones:

  • Sigmoid
  • ReLU
  • Tanh

Sigmoid

  • Output between 0 and 1
  • Used in binary classification.

ReLU

f(x) = max(0, x)
  • Most commonly used.
  • Fast & efficient.

🔟 Convolutional Neural Network (CNN)

Used in:

  • Image recognition
  • Computer vision

Instead of normal neurons, CNN uses filters to detect patterns.

Example:

  • Edge detection
  • Texture
  • Shapes

1️⃣1️⃣ Recurrent Neural Network (RNN)

Used for sequential data:

  • Text
  • Speech
  • Time series

RNN remembers previous information.

Unlike feedforward, It has memory.

🔵 Why Neural Networks Matter in Predictive Analytics

Because:

Some patterns are too complex for:

  • Linear regression
  • Decision trees

Neural networks handle:

  • Non-linear relationships
  • High-dimensional data
  • Image & text data

🔵 PHASE 6: MODEL PERFORMANCE & ENSEMBLE METHODS

This phase separates beginners from serious data professionals.

🔹 PART 1: Bias–Variance Tradeoff

This is the heart of machine learning.

1️⃣ What is Bias?

Bias = error due to wrong assumptions.

Example:

Fitting a straight line to curved data.

Model is too simple.

Result:

👉 Underfitting

2️⃣ What is Variance?

Variance = error due to model being too sensitive.

Example:

Model memorizes training data exactly.

Result:

👉 Overfitting

3️⃣ Underfitting vs Overfitting

Underfitting:

  • High bias
  • Low training accuracy
  • Low testing accuracy

Overfitting:

  • Low training error
  • High testing error

Goal:

👉 Balance both

🔹 PART 2: Cross Validation

Training accuracy alone is dangerous.

We need to test model reliability.

4️⃣ Train-Test Split

Basic method:

Split data:

  • 70% training
  • 30% testing

But problem:

Result depends on random split.

5️⃣ K-Fold Cross Validation (VERY IMPORTANT)

Process:

  1. Divide data into K equal parts
  2. Train on K-1 parts
  3. Test on remaining part
  4. Repeat K times
  5. Average performance

6️⃣ Leave-One-Out (LOO)

Extreme case of K-fold:

K = number of samples

Very expensive computationally.

🔹 PART 3: Ensemble Methods

Ensemble = combining multiple models.

Core Idea:

Many weak models together become strong.

7️⃣ Bagging (Bootstrap Aggregating)

Example: Random Forest.

Process:

  1. Create multiple datasets by sampling with replacement
  2. Train model on each dataset
  3. Combine predictions (average or voting)

Reduces:

👉 Variance

8️⃣ Random Forest

Most popular ensemble algorithm.

It is:

👉 Many decision trees combined.

Each tree:

  • Trained on different data sample
  • Uses random subset of features

Final output:

  • Majority vote (classification)
  • Average (regression)

Why powerful?

  • Reduces overfitting
  • Handles high-dimensional data well

9️⃣ Boosting

Unlike bagging:

  • Bagging → models independent
  • Boosting → models sequential

Each new model:

Focuses on correcting previous errors.

Popular Boosting Methods:

  • AdaBoost
  • Gradient Boosting
  • XGBoost

🔟 Bagging vs Boosting

BaggingBoosting
Reduces varianceReduces bias
Parallel trainingSequential training
Example: Random ForestExample: AdaBoost

Mini Project of using Next.js and Tailwind CSS