Naive Bayes — Probability দিয়ে classification — মেশিন লার্নিং

Hook — Email এর শব্দ গুনে spam ধরা

“Lottery”, “Free”, “Win” — এই শব্দগুলো বেশি থাকলে email টা সম্ভবত spam। প্রতিটি শব্দ স্বাধীনভাবে probability বাড়ায়। Naive Bayes এই “শব্দ গুনে probability” এর গণিত।

Concept — Bayes Theorem

P(class | features) = P(features | class) · P(class) / P(features)

“Naive” কারণ — আমরা ধরে নিই সব feature পরস্পর independent (যা বাস্তবে অনেক সময় ঠিক না, তবু কাজ করে চমৎকার)।

Variants

Gaussian NB — continuous feature (normal distribution)।
Multinomial NB — count data (text classification)।
Bernoulli NB — binary feature (word present/absent)।

Math — Log-Probability ও Laplace Smoothing

অনেক feature এর probability গুণ করলে very small number → underflow। তাই log নিয়ে কাজ করি।

log P(c|x) ∝ log P(c) + Σ log P(xᵢ | c)

Laplace Smoothing

P(xᵢ | c) = (count + α) / (total + α·|V|)

Training এ যে word একবারও দেখেনি — তার probability 0 হয়ে গেলে পুরো product 0। Smoothing এটি ঠেকায়।

Real-world Use

Email spam filter।
Sentiment analysis (text)।
News categorization।
Medical diagnosis (probabilistic)।

Code — Spam Detection

spam_nb.py

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

texts = [
    "Win a free iPhone now",
    "Lottery winner claim prize",
    "Meeting at 5pm tomorrow",
    "Project deadline reminder",
    "Free vacation click here",
    "Lunch plan with team",
]
labels = [1, 1, 0, 0, 1, 0]   # 1 = spam

Xtr, Xte, ytr, yte = train_test_split(texts, labels, test_size=0.33, random_state=0)

model = make_pipeline(CountVectorizer(), MultinomialNB(alpha=1.0)).fit(Xtr, ytr)
print("Test acc:", model.score(Xte, yte))
print(model.predict(["claim your free prize", "send the report by evening"]))

Common Mistakes

Continuous feature এ Multinomial NB ব্যবহার।
Smoothing α=0 → zero probability problem।
Independence ধরে nb কে সব জায়গায় চাপিয়ে দেওয়া।

Practice Tasks

Task 1: TfidfVectorizer ও CountVectorizer এর accuracy তুলনা।
Task 2: α = 0.01, 0.5, 1, 5 এ result তুলনা।
Task 3: 20Newsgroups dataset এ MultinomialNB train।

Mini Project — SMS Spam Classifier

UCI SMS Spam Collection dataset দিয়ে spam detector বানাও। Precision, Recall report করো এবং কয়েকটি real SMS এ test করে দেখো।

Summary

এক নজরে

Naive Bayes = Bayes + “independent feature” assumption = Fast, simple, surprisingly strong text classifier।