Hook — Notebook থেকে Production
Jupyter এ ৯৫% accuracy পেয়েছ — অভিনন্দন! কিন্তু সেটা ব্যবহার করবে কে? Deployment মানে model কে API আকারে এমনভাবে serve করা যাতে app, mobile, অন্য service request পাঠাতে পারে।
Deployment Options
- REST API — FastAPI, Flask।
- gRPC — low-latency, microservice।
- Batch — scheduled job, large data।
- Streaming — Kafka, real-time।
- Edge — Mobile/Browser (ONNX, TFLite, WebGPU)।
- Serverless — Lambda, Cloud Functions।
Model Serialization
- Pickle / joblib — sklearn।
- TorchScript / state_dict — PyTorch।
- SavedModel / H5 — TensorFlow।
- ONNX — framework-agnostic, fast runtime।
- Safetensors — secure, fast load (LLM)।
save.py
import joblib
joblib.dump(model, "model.pkl")
# load
model = joblib.load("model.pkl")Code — FastAPI Serving
app.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib, numpy as np
app = FastAPI(title="Iris Classifier")
model = joblib.load("model.pkl")
CLASSES = ["setosa", "versicolor", "virginica"]
class Features(BaseModel):
sepal_length: float
sepal_width: float
petal_length: float
petal_width: float
@app.get("/health")
def health(): return {"status": "ok"}
@app.post("/predict")
def predict(f: Features):
x = np.array([[f.sepal_length, f.sepal_width, f.petal_length, f.petal_width]])
proba = model.predict_proba(x)[0]
idx = int(np.argmax(proba))
return {"class": CLASSES[idx], "confidence": float(proba[idx])}run
uvicorn app:app --host 0.0.0.0 --port 8000 --workers 4
# OpenAPI docs: http://localhost:8000/docsProduction Essentials
- Input validation — Pydantic / Zod।
- Versioning — /v1/predict, model version header।
- Logging — request id, latency, prediction।
- Async / batch endpoint — throughput বাড়ায়।
- Health & readiness probe।
- Rate limiting, auth (API key/JWT)।
- Auto-scaling — load অনুযায়ী।
LLM-Specific Serving
- vLLM — PagedAttention, ~24x throughput।
- TGI (Text Generation Inference)।
- Ollama — local দ্রুত serve।
- TensorRT-LLM — NVIDIA optimized।
- Streaming response (SSE)।
Common Mistakes
- Preprocessing API তে duplicate না করা।
- Single-threaded inference (GIL bottleneck)।
- Model load প্রতি request এ (instead of startup)।
- Secret hardcode — env var ব্যবহার করো।
Summary
এক নজরে
Deployment = Serialize → API (FastAPI) → validate, log, scale → versioned endpoint। LLM হলে vLLM/TGI।