Transformers — Attention is all you need — মেশিন লার্নিং

Hook — Attention is All You Need

২০১৭ সালে Google এর paper ‘Attention is All You Need’ পুরো AI বদলে দিল। RNN বাদ, শুধু Attention দিয়ে sequence model — দ্রুত, parallel, scalable। GPT, BERT, ChatGPT — সবই Transformer।

Why Transformer?

Full parallelism — RNN এর মতো sequential নয়।
Long-range dependency — সরাসরি যেকোনো position এর সাথে।
Scale করলে performance বাড়তেই থাকে (scaling laws)।

Architecture Overview

Encoder — input কে representation এ পরিণত করে (BERT)।
Decoder — output token একে একে generate করে (GPT)।
Encoder-Decoder — translation, summarization (T5, BART)।

Building Blocks

Token Embedding — শব্দ → vector।
Positional Encoding — order এর তথ্য (sinusoidal বা learned)।
Multi-Head Self-Attention — heart of Transformer।
Feed-Forward Network — position-wise MLP।
Layer Normalization + Residual Connection।

Self-Attention — সংক্ষেপে

প্রতিটি token অন্য সব token এর দিকে ‘তাকায়’ — কাকে কতটুকু গুরুত্ব দেবে সেটা শেখে।

Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V

Multi-Head মানে একাধিক attention parallel — আলাদা আলাদা সম্পর্ক ধরতে।

Code — HuggingFace দিয়ে BERT

bert_sentiment.py

from transformers import pipeline

clf = pipeline("sentiment-analysis",
               model="distilbert-base-uncased-finetuned-sst-2-english")

print(clf("Lovable দিয়ে app বানানো সত্যিই দারুণ!"))
# [{'label': 'POSITIVE', 'score': 0.99}]

GPT-style Text Generation

gpt_generate.py

from transformers import pipeline
gen = pipeline("text-generation", model="gpt2")
print(gen("Machine learning is", max_length=30, num_return_sequences=1))

Transformer Family

BERT — encoder, bidirectional, classification/QA।
GPT — decoder, autoregressive, generation।
T5 / BART — encoder-decoder, seq2seq।
ViT — image patch দিয়ে Transformer।
LLaMA, Mistral, Gemma — open weight LLM।

Common Mistakes

Sequence length quadratic memory ভুলে যাওয়া।
Tokenizer mismatch — model ও tokenizer একসাথে দরকার।
Padding mask না দিয়ে batch করা।

Summary

এক নজরে

Transformer = Self-Attention + Positional Encoding + Parallel processing। আজকের সব বড় AI এর ভিত্তি।