📚 সমস্ত অধ্যায় দেখুন
অধ্যায়/ফেজ 9 · Phase 9 · System Design
9.1৩০ মিনিট পড়া52 / 68

Scalable ML Systems

Scalable ML

Million user এর জন্য system design।

Hook — ১০ user থেকে ১০ million

Demo এ ১০ user সহজ। Million user এ latency বাড়ে, GPU পুড়ে, খরচ আকাশ ছোঁয়। Scalable ML System Design সেই challenge এর engineering — যাতে load বাড়লেও system smooth থাকে।

Scaling Principles

  • Horizontal scale (replica) > Vertical scale (bigger machine)।
  • Stateless service — যেকোনো instance যেকোনো request handle করুক।
  • Async + queue — burst absorb করো।
  • Cache — duplicate computation এড়াও।
  • Batch when possible — GPU utilization বাড়াও।
  • Decouple — train/serve/feature pipeline আলাদা।

Reference Architecture

high-level
Client → CDN/Edge → API Gateway (auth, rate limit)
                      ↓
               Load Balancer
                      ↓
        ┌────────── Inference Service (auto-scale) ──────────┐
        │  Feature Cache (Redis) ← Online Feature Store     │
        │  Model Server (vLLM/Triton/TorchServe)            │
        │  Fallback model (small/cached)                     │
        └────────────────────────────────────────────────────┘
                      ↓
         Logging / Metrics / Trace
                      ↓
      Offline Pipeline → Retrain → Model Registry → Canary Deploy

Distributed Training

  • Data Parallel — same model, different batch (DDP)।
  • Model Parallel — model কে split করো (large LLM)।
  • Pipeline Parallel — layer ভাগ করে stage।
  • ZeRO / FSDP — optimizer/gradient/parameter sharding।
  • Tools — PyTorch DDP/FSDP, DeepSpeed, Megatron, Ray Train।

Serving Patterns

  • Dynamic Batching — incoming request group করে GPU efficient।
  • Continuous Batching (vLLM) — LLM throughput ~10-24×।
  • Multi-Model Serving — একই server এ একাধিক model।
  • Model Caching — hot model GPU তে।
  • Replica auto-scale — QPS/latency-based HPA।
  • Async inference + webhook callback।

Code — Ray Serve (Auto-scaling)

ray_serve.py
import ray
from ray import serve

@serve.deployment(
    num_replicas="auto",
    autoscaling_config={"min_replicas": 1, "max_replicas": 20,
                        "target_num_ongoing_requests_per_replica": 5},
    ray_actor_options={"num_gpus": 1},
)
class LLMService:
    def __init__(self):
        from transformers import pipeline
        self.gen = pipeline("text-generation",
                            model="meta-llama/Llama-3.2-3B-Instruct",
                            device=0)

    async def __call__(self, request):
        body = await request.json()
        out = self.gen(body["prompt"], max_new_tokens=200)
        return {"text": out[0]["generated_text"]}

serve.run(LLMService.bind(), route_prefix="/generate")

Data Pipeline Scaling

  • Partitioning + parallel read (Spark, Ray Data)।
  • Streaming feature compute (Flink, Spark Structured Streaming)।
  • Online + offline feature store।
  • CDC (Change Data Capture) — Debezium।
  • Backpressure handling।

Reliability Patterns

  • Circuit Breaker — downstream fail হলে fast fail।
  • Retry with exponential backoff + jitter।
  • Timeout সবখানে।
  • Bulkhead — pool isolation।
  • Graceful degradation — full model fail হলে smaller fallback।
  • Multi-region failover।

Mini Case Study — News Feed Ranking

Two-stage architecture:

  • Candidate Generation — ১ million post → 1000 (cheap embedding ANN)।
  • Ranking — 1000 → 50 (heavy DNN, GPU)।
  • Re-ranking — diversity, business rule (CPU)।
  • Feature: user embedding (Redis), post embedding (Faiss/ScaNN)।

Summary

এক নজরে

Scalable ML = Stateless + Cache + Batch + Async + Auto-scale + Multi-stage retrieval-ranking + Reliability pattern।