Ensemble methods and advanced topics (boosting, bagging)

1️⃣ What is Ensemble Learning?

Ensemble Learning means:

Combining multiple models to create a stronger model.

Instead of relying on one model, we combine many.

Think of it like:

One doctor → opinion
10 doctors → better diagnosis

2️⃣ Why Ensembles Work

Individual models may:

Overfit
Underfit
Make random errors

But combining models:

Reduces variance
Reduces bias
Improves generalization

3️⃣ Types of Ensemble Methods


| Method | Idea |
|--------|------|
| Bagging | Train models independently in parallel |
| Boosting | Train models sequentially, correcting errors |
| Stacking | Combine predictions using another model |

4️⃣ Bagging (Bootstrap Aggregating)

Bagging works like this:

Create multiple random subsets of data (with replacement)
Train a model on each subset
Average predictions (regression)
or majority vote (classification)

Goal:

Reduce variance.

🔹 Why “Bootstrap”?

Because we sample with replacement.

Some data points appear multiple times,

some not at all.

5️⃣ Random Forest (Most Popular Bagging Method)

Random Forest = Bagging + Decision Trees.

Instead of one tree:

Build many decision trees
Each tree sees random subset of data
Each tree sees random subset of features
Final output = average or majority vote

Random Forest reduces overfitting compared to a single tree.

6️⃣ Boosting

Boosting works differently.

Instead of independent models:

Models are trained sequentially.

Each new model:

Focuses on correcting previous model’s mistakes.

Goal:

Reduce bias.

7️⃣ AdaBoost (Adaptive Boosting)

AdaBoost:

Train weak learner (usually small tree)
Increase weight of misclassified points
Train next learner focusing on errors
Combine models with weighted voting

It adapts to mistakes.

8️⃣ Gradient Boosting

Gradient Boosting:

Optimizes loss function directly
Each new model fits the residual errors
Uses gradient descent idea

Very powerful.

Popular implementations:

XGBoost
LightGBM
CatBoost

Used heavily in:

Kaggle competitions
Industry ML systems

9️⃣ Bagging vs Boosting

| Feature | Bagging | Boosting |
|----------|----------|----------|
| Training | Parallel | Sequential |
| Goal | Reduce variance | Reduce bias |
| Overfitting | Less prone | Can overfit |
| Example | Random Forest | Gradient Boosting |

🔟 Applications of Ensemble Methods

| Application | Example |
|-------------|----------|
| Fraud Detection | Banking systems |
| Medical Diagnosis | Disease prediction |
| Recommendation Systems | User ranking |
| Credit Scoring | Loan approval |
| Competitions | Kaggle winning models |

1️⃣1️⃣ Python Example — Random Forest (Bagging)

🔹 Install Libraries

pip install numpy pandas scikit-learn

🔹 Random Forest Classification

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Load data
data = load_iris()
X = data.data
y = data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Model
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

# Predict
predictions = rf.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, predictions))

1️⃣2️⃣ Python Example — Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
gb.fit(X_train, y_train)

predictions_gb = gb.predict(X_test)

print("Gradient Boosting Accuracy:", accuracy_score(y_test, predictions_gb))

1️⃣3️⃣ What You Should Understand

After this topic, you should know:

Why combining models improves performance
Difference between bagging and boosting
How Random Forest works
How Gradient Boosting works
When to use ensemble methods

FULL COMPILATION OF ALL CODE

Example Code:

# Install:
# pip install numpy pandas scikit-learn

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Load data
data = load_iris()
X = data.data
y = data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# -------------------------
# RANDOM FOREST
# -------------------------
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

rf_predictions = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, rf_predictions))

# -------------------------
# GRADIENT BOOSTING
# -------------------------
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
gb.fit(X_train, y_train)

gb_predictions = gb.predict(X_test)
print("Gradient Boosting Accuracy:", accuracy_score(y_test, gb_predictions))

Machine Learning

Ensemble methods and advanced topics (boosting, bagging)

Example Code:

Machine Learning

All Courses