Random Forest — গাছের জঙ্গল — মেশিন লার্নিং

Hook — একজন বিশেষজ্ঞ বনাম পঞ্চাশ জন

একজন ডাক্তার ভুল করতে পারেন। কিন্তু ৫০ জন ডাক্তার আলাদাভাবে দেখে সবার majority opinion নিলে — ভুলের probability অনেক কম। Random Forest ঠিক এই “majority vote” এর শক্তি কাজে লাগায়।

Concept — Bagging + Random Feature

Random Forest = অনেকগুলো Decision Tree, প্রত্যেকে data এর random subset (bootstrap sample) ও feature এর random subset দিয়ে train। Final prediction = majority vote (classification) বা average (regression)।

Bootstrap — replacement সহ random sampling।
Feature subset — প্রতি split এ random feature বেছে নেওয়া।
Aggregation — সব tree এর output combine।

Variance reduction

অনেক tree এর average নিলে variance কমে — তাই forest single tree থেকে অনেক stable।

Math — কেন কাজ করে

যদি প্রতিটি tree এর error ε হয় এবং tree গুলো প্রায় independent হয়, তবে B টি tree এর average এর variance σ²/B তে নেমে আসে — bias প্রায় একই থেকে variance কমে।

Var(avg) ≈ ρσ² + (1−ρ)σ²/B

এখানে ρ = tree গুলোর মধ্যে correlation। Feature subsampling correlation কমায় — তাই forest আরও শক্তিশালী হয়।

Real-world Use

Tabular data এ industry default baseline।
Feature importance বের করা।
Fraud detection।
Medical risk scoring।

Code — Sklearn Random Forest

rf_demo.py

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

data = load_breast_cancer()
X, y = data.data, data.target
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(
    n_estimators=300, max_depth=None,
    min_samples_leaf=2, n_jobs=-1, random_state=42
).fit(Xtr, ytr)

print("Test acc:", model.score(Xte, yte))

imp = pd.Series(model.feature_importances_, index=data.feature_names)
print(imp.sort_values(ascending=False).head(10))

Common Mistakes

n_estimators খুব কম — variance বেশি।
max_depth=None রেখে memory blow up।
Feature importance কে causation ভেবে নেওয়া।
OOB score কে evaluation এ না দেখা।

Practice Tasks

Task 1: n_estimators = 10, 50, 100, 500 এ test accuracy ও training time তুলনা।
Task 2: oob_score=True দিয়ে out-of-bag accuracy দেখো।
Task 3: ExtraTreesClassifier এর সাথে result তুলনা করো।

Mini Project — Telco Churn Predictor

Kaggle Telco Customer Churn dataset এ Random Forest train করো। Top 10 feature importance বের করো এবং business stakeholder এর জন্য Bengali তে summary লিখো।

Summary

এক নজরে

Random Forest = অনেক Decision Tree + Bagging + Random feature = Robust, low-variance powerhouse।