Hook — LLM কে জ্ঞান দাও
LLM সবকিছু জানে না, আর ভুল ও বলে। RAG (Retrieval-Augmented Generation) সেই সমস্যার সমাধান — তোমার document থেকে relevant অংশ এনে LLM কে answer করতে বলে।
End-to-End Pipeline
rag-flow
Ingest: Docs → Chunk → Embed → Vector DB
Query: Question → Embed → Top-K retrieve → Rerank
Generate: (Context + Question) → LLM → Answer + CitationsChunking Strategy
- Fixed token (e.g. 512) + overlap (50–100)।
- Recursive splitter — paragraph → sentence।
- Semantic chunking — embedding similarity break।
- Markdown/PDF aware — heading এর সাথে chunk।
- Metadata রাখো: source, page, section।
Ingestion Code
ingest.py
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import chromadb
emb = SentenceTransformer("BAAI/bge-small-en-v1.5")
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=80)
db = chromadb.PersistentClient("./chroma").get_or_create_collection("docs")
for path in pdf_paths:
text = load_pdf(path)
chunks = splitter.split_text(text)
db.add(
ids=[f"{path}-{i}" for i in range(len(chunks))],
documents=chunks,
embeddings=emb.encode(chunks).tolist(),
metadatas=[{"source": path, "chunk": i} for i in range(len(chunks))],
)Retrieve + Rerank
retrieve.py
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-base")
def search(q: str, k=5):
cand = db.query(query_embeddings=emb.encode([q]).tolist(), n_results=20)
docs = cand["documents"][0]
scores = reranker.predict([(q, d) for d in docs])
top = sorted(zip(docs, scores), key=lambda x: -x[1])[:k]
return [d for d, _ in top]Generation with Citations
answer.py
PROMPT = """Answer the question using ONLY the context below.
If the answer is not in the context, say "I don't know".
Cite sources as [1], [2] matching the snippets.
Context:
{ctx}
Question: {q}
Answer:"""
def answer(q: str):
chunks = search(q)
ctx = "\n\n".join(f"[{i+1}] {c}" for i, c in enumerate(chunks))
return client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role":"user","content":PROMPT.format(ctx=ctx, q=q)}]
).choices[0].message.contentAdvanced Tricks
- HyDE — hypothetical answer embed করে search।
- Multi-query — same question এর 3 rewrite → union retrieval।
- Hybrid search — BM25 + vector।
- Parent-doc retrieval — small chunk match, large chunk feed।
- Self-RAG — retrieval দরকার কিনা model নিজেই ঠিক করে।
Evaluation — RAGAS
- Faithfulness — answer context থেকে কতটা grounded।
- Answer relevance — question এর সাথে match।
- Context precision/recall — retrieved chunk এর মান।
- Tool: `ragas`, `trulens`, custom LLM-as-judge।
Deployment Tips
- Vector DB: Chroma (dev), pgvector / Qdrant / Pinecone (prod)।
- Embeddings cache — repeated query free।
- Streaming answer — token দিয়ে token UX।
- Citation UI — clickable source preview।
- Re-ingest on document update — versioned।
Summary
এক নজরে
RAG = Chunk + Embed + Retrieve + Rerank + Generate। Hallucination কমে, source citable হয়। Eval না করলে quality silent ভাবে drift করে।