Scalable ML Systems — মেশিন লার্নিং

Hook — ১০ user থেকে ১০ million

Demo এ ১০ user সহজ। Million user এ latency বাড়ে, GPU পুড়ে, খরচ আকাশ ছোঁয়। Scalable ML System Design সেই challenge এর engineering — যাতে load বাড়লেও system smooth থাকে।

Scaling Principles

Horizontal scale (replica) > Vertical scale (bigger machine)।
Stateless service — যেকোনো instance যেকোনো request handle করুক।
Async + queue — burst absorb করো।
Cache — duplicate computation এড়াও।
Batch when possible — GPU utilization বাড়াও।
Decouple — train/serve/feature pipeline আলাদা।

Reference Architecture

high-level

Client → CDN/Edge → API Gateway (auth, rate limit)
                      ↓
               Load Balancer
                      ↓
        ┌────────── Inference Service (auto-scale) ──────────┐
        │  Feature Cache (Redis) ← Online Feature Store     │
        │  Model Server (vLLM/Triton/TorchServe)            │
        │  Fallback model (small/cached)                     │
        └────────────────────────────────────────────────────┘
                      ↓
         Logging / Metrics / Trace
                      ↓
      Offline Pipeline → Retrain → Model Registry → Canary Deploy

Distributed Training

Data Parallel — same model, different batch (DDP)।
Model Parallel — model কে split করো (large LLM)।
Pipeline Parallel — layer ভাগ করে stage।
ZeRO / FSDP — optimizer/gradient/parameter sharding।
Tools — PyTorch DDP/FSDP, DeepSpeed, Megatron, Ray Train।

Serving Patterns

Dynamic Batching — incoming request group করে GPU efficient।
Continuous Batching (vLLM) — LLM throughput ~10-24×।
Multi-Model Serving — একই server এ একাধিক model।
Model Caching — hot model GPU তে।
Replica auto-scale — QPS/latency-based HPA।
Async inference + webhook callback।

Code — Ray Serve (Auto-scaling)

ray_serve.py

import ray
from ray import serve

@serve.deployment(
    num_replicas="auto",
    autoscaling_config={"min_replicas": 1, "max_replicas": 20,
                        "target_num_ongoing_requests_per_replica": 5},
    ray_actor_options={"num_gpus": 1},
)
class LLMService:
    def __init__(self):
        from transformers import pipeline
        self.gen = pipeline("text-generation",
                            model="meta-llama/Llama-3.2-3B-Instruct",
                            device=0)

    async def __call__(self, request):
        body = await request.json()
        out = self.gen(body["prompt"], max_new_tokens=200)
        return {"text": out[0]["generated_text"]}

serve.run(LLMService.bind(), route_prefix="/generate")

Data Pipeline Scaling

Partitioning + parallel read (Spark, Ray Data)।
Streaming feature compute (Flink, Spark Structured Streaming)।
Online + offline feature store।
CDC (Change Data Capture) — Debezium।
Backpressure handling।

Reliability Patterns

Circuit Breaker — downstream fail হলে fast fail।
Retry with exponential backoff + jitter।
Timeout সবখানে।
Bulkhead — pool isolation।
Graceful degradation — full model fail হলে smaller fallback।
Multi-region failover।

Mini Case Study — News Feed Ranking

Two-stage architecture:

Candidate Generation — ১ million post → 1000 (cheap embedding ANN)।
Ranking — 1000 → 50 (heavy DNN, GPU)।
Re-ranking — diversity, business rule (CPU)।
Feature: user embedding (Redis), post embedding (Faiss/ScaNN)।

Summary

এক নজরে

Scalable ML = Stateless + Cache + Batch + Async + Auto-scale + Multi-stage retrieval-ranking + Reliability pattern।