Hook — ১০ user থেকে ১০ million
Demo এ ১০ user সহজ। Million user এ latency বাড়ে, GPU পুড়ে, খরচ আকাশ ছোঁয়। Scalable ML System Design সেই challenge এর engineering — যাতে load বাড়লেও system smooth থাকে।
Scaling Principles
- Horizontal scale (replica) > Vertical scale (bigger machine)।
- Stateless service — যেকোনো instance যেকোনো request handle করুক।
- Async + queue — burst absorb করো।
- Cache — duplicate computation এড়াও।
- Batch when possible — GPU utilization বাড়াও।
- Decouple — train/serve/feature pipeline আলাদা।
Reference Architecture
high-level
Client → CDN/Edge → API Gateway (auth, rate limit)
↓
Load Balancer
↓
┌────────── Inference Service (auto-scale) ──────────┐
│ Feature Cache (Redis) ← Online Feature Store │
│ Model Server (vLLM/Triton/TorchServe) │
│ Fallback model (small/cached) │
└────────────────────────────────────────────────────┘
↓
Logging / Metrics / Trace
↓
Offline Pipeline → Retrain → Model Registry → Canary DeployDistributed Training
- Data Parallel — same model, different batch (DDP)।
- Model Parallel — model কে split করো (large LLM)।
- Pipeline Parallel — layer ভাগ করে stage।
- ZeRO / FSDP — optimizer/gradient/parameter sharding।
- Tools — PyTorch DDP/FSDP, DeepSpeed, Megatron, Ray Train।
Serving Patterns
- Dynamic Batching — incoming request group করে GPU efficient।
- Continuous Batching (vLLM) — LLM throughput ~10-24×।
- Multi-Model Serving — একই server এ একাধিক model।
- Model Caching — hot model GPU তে।
- Replica auto-scale — QPS/latency-based HPA।
- Async inference + webhook callback।
Code — Ray Serve (Auto-scaling)
ray_serve.py
import ray
from ray import serve
@serve.deployment(
num_replicas="auto",
autoscaling_config={"min_replicas": 1, "max_replicas": 20,
"target_num_ongoing_requests_per_replica": 5},
ray_actor_options={"num_gpus": 1},
)
class LLMService:
def __init__(self):
from transformers import pipeline
self.gen = pipeline("text-generation",
model="meta-llama/Llama-3.2-3B-Instruct",
device=0)
async def __call__(self, request):
body = await request.json()
out = self.gen(body["prompt"], max_new_tokens=200)
return {"text": out[0]["generated_text"]}
serve.run(LLMService.bind(), route_prefix="/generate")Data Pipeline Scaling
- Partitioning + parallel read (Spark, Ray Data)।
- Streaming feature compute (Flink, Spark Structured Streaming)।
- Online + offline feature store।
- CDC (Change Data Capture) — Debezium।
- Backpressure handling।
Reliability Patterns
- Circuit Breaker — downstream fail হলে fast fail।
- Retry with exponential backoff + jitter।
- Timeout সবখানে।
- Bulkhead — pool isolation।
- Graceful degradation — full model fail হলে smaller fallback।
- Multi-region failover।
Mini Case Study — News Feed Ranking
Two-stage architecture:
- Candidate Generation — ১ million post → 1000 (cheap embedding ANN)।
- Ranking — 1000 → 50 (heavy DNN, GPU)।
- Re-ranking — diversity, business rule (CPU)।
- Feature: user embedding (Redis), post embedding (Faiss/ScaNN)।
Summary
এক নজরে
Scalable ML = Stateless + Cache + Batch + Async + Auto-scale + Multi-stage retrieval-ranking + Reliability pattern।