Hook — NLP এর প্রথম যুদ্ধ
প্রতিদিন কোটি কোটি spam message ছাঁকা হয়। Spam detection হলো text classification এর classic project — এখান থেকেই সব NLP pipeline শুরু।
Problem & Dataset
- Goal: SMS/email কে spam (1) বা ham (0) classify করা।
- Dataset: UCI SMS Spam Collection (5,572 message)।
- Metric: Precision (spam কে ham ভাবা better), F1, ROC-AUC।
- Class imbalance: ~13% spam → stratified split দরকার।
Text Preprocessing
clean.py
import re, string
def clean(text: str) -> str:
text = text.lower()
text = re.sub(r"http\S+", " URL ", text)
text = re.sub(r"\d+", " NUM ", text)
text = text.translate(str.maketrans("", "", string.punctuation))
return " ".join(text.split())- Lowercase + URL/number normalize।
- Punctuation strip, whitespace collapse।
- Stopwords রাখা যেতে পারে — short text এ সাহায্য করে।
- Stemming/Lemmatization optional।
Baseline — TF-IDF + Logistic Regression
baseline.py
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
pipe = Pipeline([
("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=2, max_features=20000)),
("clf", LogisticRegression(max_iter=1000, class_weight="balanced")),
])
pipe.fit(X_train, y_train)
print(classification_report(y_test, pipe.predict(X_test)))এই baseline ই সাধারণত ~98% accuracy দেয়। বেশি জটিল কিছু আনার আগে এটাই benchmark।
Upgrade — Transformer Fine-tune
hf_finetune.py
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import datasets
ds = datasets.load_dataset("sms_spam")
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def enc(b): return tok(b["sms"], truncation=True, padding="max_length", max_length=64)
ds = ds.map(enc, batched=True).rename_column("label","labels")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
args = TrainingArguments("out", num_train_epochs=3, per_device_train_batch_size=32,
evaluation_strategy="epoch", learning_rate=2e-5)
Trainer(model=model, args=args, train_dataset=ds["train"], eval_dataset=ds["train"]).train()Evaluation Choices
- Confusion matrix — false positive cost বেশি (legit message blocked)।
- PR-curve > ROC যখন class imbalanced।
- Threshold tuning — default 0.5 নয়, business cost অনুযায়ী।
- Explainability: LIME/SHAP দিয়ে কোন word spam signal।
Deploy — FastAPI + Streamlit
api.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
pipe = joblib.load("spam_pipe.pkl")
app = FastAPI()
class Msg(BaseModel):
text: str
@app.post("/predict")
def predict(m: Msg):
p = pipe.predict_proba([m.text])[0,1]
return {"spam_probability": float(p), "is_spam": bool(p > 0.5)}Lessons
- Baseline আগে — Transformer পরে।
- Class imbalance ignore করলে metric মিথ্যা বলে।
- Threshold পণ্য অনুযায়ী tune করো।
- Same preprocessing train ও inference — pipeline pickle।
Summary
এক নজরে
Spam detection = Text classification এর সবচেয়ে clean intro। TF-IDF + LR দিয়ে শুরু, প্রয়োজনে DistilBERT। Threshold ও explainability দুটোই production এ জরুরি।