Real-time Inference — মেশিন লার্নিং

Hook — ১০০ms এর মধ্যে

Search auto-complete, fraud detection, ad bidding, voice assistant — সবার দরকার millisecond response। Real-time inference = সঠিক উত্তর + কম latency, দুটোই একসাথে।

Latency Budget

100ms budget breakdown

Client → Edge (TLS + network):    20 ms
API Gateway + auth:                5 ms
Feature lookup (Redis):           10 ms
Model inference:                  40 ms
Post-process + business rule:     10 ms
Network back to client:           15 ms
───────────────────────────────────────
Total:                           100 ms  ✅

তোমার budget

প্রথমে total budget ঠিক করো, তারপর প্রতিটা step এ ভাগ করো। কোথায় বাকি নেই সেটা আগে জানো।

Latency Reduction Techniques

Model Optimization

Quantization — FP32 → INT8/INT4 (২-৪x দ্রুত)।
Pruning — অপ্রয়োজনীয় weight বাদ।
Knowledge Distillation — bigger teacher → smaller student।
ONNX Runtime, TensorRT, OpenVINO — graph optimization।
Compilation — torch.compile, XLA।
Flash Attention, Paged KV cache (LLM)।

System Level

GPU warm pool — cold start এড়াও।
gRPC / HTTP/2 — connection reuse।
Co-locate model + feature store।
Pre-compute embedding offline, ANN online।
Speculative decoding (LLM)।

Code — ONNX Runtime

to_onnx.py

import torch
torch.onnx.export(model, dummy_input, "model.onnx",
                  input_names=["input"], output_names=["logits"],
                  dynamic_axes={"input": {0: "batch"}}, opset_version=17)

serve_onnx.py

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

def predict(x: np.ndarray):
    return sess.run(["logits"], {"input": x.astype(np.float32)})[0]

Dynamic Batching

GPU batch=1 এ inefficient। ১০ms wait করে আসা request batch করলে latency সামান্য বাড়ে, কিন্তু throughput ৫-১০x হয়।

Triton Inference Server — built-in dynamic batching।
vLLM — continuous batching (LLM specific)।
TorchServe — config দিয়ে enable।
Trade-off — max batch size vs max wait time tune করো।

Streaming Response

LLM সম্পূর্ণ output শেষ হওয়ার আগেই token-by-token পাঠাও — perceived latency অনেক কম।

stream.py

from fastapi import FastAPI
from fastapi.responses import StreamingResponse

@app.post("/chat")
async def chat(body: dict):
    async def gen():
        async for token in llm.stream(body["prompt"]):
            yield f"data: {token}\n\n"
    return StreamingResponse(gen(), media_type="text/event-stream")

Edge Inference

Browser — ONNX.js, WebGPU, Transformers.js।
Mobile — Core ML (iOS), TFLite/LiteRT (Android)।
CDN edge — Cloudflare Workers AI, Vercel Edge।
Why — privacy, zero round-trip latency, offline।
Cost — model ছোট, quantized হতে হবে।

Measuring Right

p50 / p95 / p99 latency — average misleading।
Cold start latency আলাদা track।
End-to-end (client-perceived) measure করো, শুধু model নয়।
Load test — k6, Locust, Vegeta।

Summary

এক নজরে

Real-time = Budget → Optimize model (quant/distill/ONNX) → Optimize system (batch/cache/stream) → Measure p95/p99। Edge এ নিলে আরও দ্রুত।