Hook — ভাষা বুঝে নেওয়া
মানুষের ভাষা ambiguous, context-dependent। NLP সেই ভাষাকে কম্পিউটারের কাছে বোধগম্য করে — search, translation, chatbot, summarization সব এর প্রয়োগ।
Classical NLP Pipeline
- Tokenization — text → token।
- Lowercasing, Stopword removal।
- Stemming / Lemmatization — base form।
- POS tagging, NER।
- Vectorization — Bag-of-Words, TF-IDF।
Word Embeddings
- Word2Vec — CBOW, Skip-gram।
- GloVe — global co-occurrence।
- FastText — subword।
- Contextual (ELMo, BERT) — একই শব্দ context অনুযায়ী আলাদা vector।
king − man + woman ≈ queen
Modern NLP — Transformer Era
- BERT — bidirectional encoder, classification/QA।
- GPT — generation।
- T5 — text-to-text everything।
- Sentence-BERT — semantic similarity।
Common Tasks
- Text Classification (sentiment, spam)।
- NER — entity extraction।
- Question Answering।
- Summarization।
- Translation।
- Topic Modeling (LDA)।
Code — HuggingFace দিয়ে কয়েকটি Task
nlp_demo.py
from transformers import pipeline
clf = pipeline("sentiment-analysis")
print(clf("Lovable দারুণ tool!"))
ner = pipeline("ner", grouped_entities=True)
print(ner("Sundar Pichai is the CEO of Google in California."))
summ = pipeline("summarization")
print(summ("Long article text...", max_length=60, min_length=20))TF-IDF + Logistic Regression (classical)
tfidf.py
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(TfidfVectorizer(ngram_range=(1,2), min_df=2),
LogisticRegression(max_iter=1000))
pipe.fit(X_train, y_train)
print(pipe.score(X_test, y_test))Common Mistakes
- Tokenizer ও model mismatch।
- Bangla/multilingual text এ English-only model।
- Stopword removal করে অর্থ নষ্ট করা (BERT এ দরকার নেই)।
Summary
এক নজরে
NLP = Token → Embedding → Model (classical বা Transformer) → Task। আজ Transformer ই default।