Spam Detection — মেশিন লার্নিং

Hook — NLP এর প্রথম যুদ্ধ

প্রতিদিন কোটি কোটি spam message ছাঁকা হয়। Spam detection হলো text classification এর classic project — এখান থেকেই সব NLP pipeline শুরু।

Problem & Dataset

Goal: SMS/email কে spam (1) বা ham (0) classify করা।
Dataset: UCI SMS Spam Collection (5,572 message)।
Metric: Precision (spam কে ham ভাবা better), F1, ROC-AUC।
Class imbalance: ~13% spam → stratified split দরকার।

Text Preprocessing

clean.py

import re, string
def clean(text: str) -> str:
    text = text.lower()
    text = re.sub(r"http\S+", " URL ", text)
    text = re.sub(r"\d+", " NUM ", text)
    text = text.translate(str.maketrans("", "", string.punctuation))
    return " ".join(text.split())

Lowercase + URL/number normalize।
Punctuation strip, whitespace collapse।
Stopwords রাখা যেতে পারে — short text এ সাহায্য করে।
Stemming/Lemmatization optional।

Baseline — TF-IDF + Logistic Regression

baseline.py

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=2, max_features=20000)),
    ("clf",   LogisticRegression(max_iter=1000, class_weight="balanced")),
])
pipe.fit(X_train, y_train)
print(classification_report(y_test, pipe.predict(X_test)))

এই baseline ই সাধারণত ~98% accuracy দেয়। বেশি জটিল কিছু আনার আগে এটাই benchmark।

Upgrade — Transformer Fine-tune

hf_finetune.py

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import datasets

ds = datasets.load_dataset("sms_spam")
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def enc(b): return tok(b["sms"], truncation=True, padding="max_length", max_length=64)
ds = ds.map(enc, batched=True).rename_column("label","labels")

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
args = TrainingArguments("out", num_train_epochs=3, per_device_train_batch_size=32,
                         evaluation_strategy="epoch", learning_rate=2e-5)
Trainer(model=model, args=args, train_dataset=ds["train"], eval_dataset=ds["train"]).train()

Evaluation Choices

Confusion matrix — false positive cost বেশি (legit message blocked)।
PR-curve > ROC যখন class imbalanced।
Threshold tuning — default 0.5 নয়, business cost অনুযায়ী।
Explainability: LIME/SHAP দিয়ে কোন word spam signal।

Deploy — FastAPI + Streamlit

api.py

from fastapi import FastAPI
from pydantic import BaseModel
import joblib
pipe = joblib.load("spam_pipe.pkl")
app = FastAPI()

class Msg(BaseModel):
    text: str

@app.post("/predict")
def predict(m: Msg):
    p = pipe.predict_proba([m.text])[0,1]
    return {"spam_probability": float(p), "is_spam": bool(p > 0.5)}

Lessons

Baseline আগে — Transformer পরে।
Class imbalance ignore করলে metric মিথ্যা বলে।
Threshold পণ্য অনুযায়ী tune করো।
Same preprocessing train ও inference — pipeline pickle।

Summary

এক নজরে

Spam detection = Text classification এর সবচেয়ে clean intro। TF-IDF + LR দিয়ে শুরু, প্রয়োজনে DistilBERT। Threshold ও explainability দুটোই production এ জরুরি।