Hook — Deploy = শেষ নয়, শুরু
World বদলায়, user বদলায়, data বদলায় — কিন্তু model fixed। Monitoring ছাড়া তুমি জানবেই না কখন model পচে গেছে। Production ML এর সবচেয়ে underrated অংশ।
কী Monitor করব?
- Operational — latency (p50/p95/p99), throughput, error rate, uptime।
- Data quality — missing, range violation, schema break।
- Data Drift — input distribution change।
- Concept Drift — input-output সম্পর্ক change।
- Model performance — accuracy, AUC (label এলে)।
- Business KPI — conversion, revenue।
- Fairness — subgroup performance।
Data vs Concept Drift
পার্থক্য
Data drift = P(X) বদলেছে (user demographic shift)। Concept drift = P(y|X) বদলেছে (covid এর পর spending pattern বদলেছে)।
- Test — KS test, PSI (Population Stability Index), Wasserstein।
- PSI < 0.1 = stable, 0.1-0.25 = moderate, >0.25 = alert।
Code — Evidently দিয়ে Drift Report
drift.py
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, RegressionPreset
report = Report(metrics=[DataDriftPreset(), RegressionPreset()])
report.run(reference_data=ref_df, current_data=prod_df)
report.save_html("drift_report.html")
result = report.as_dict()
if result["metrics"][0]["result"]["dataset_drift"]:
alert_slack("⚠️ Data drift detected")Prometheus Metrics from FastAPI
metrics.py
from prometheus_fastapi_instrumentator import Instrumentator
Instrumentator().instrument(app).expose(app, endpoint="/metrics")
from prometheus_client import Counter, Histogram
PRED = Counter("predictions_total", "Total predictions", ["class"])
LAT = Histogram("predict_latency_seconds", "Latency")
@app.post("/predict")
def predict(f: Features):
with LAT.time():
out = model.predict(...)
PRED.labels(class_=out).inc()
return outMonitoring Stack
- Metrics — Prometheus + Grafana।
- Logs — Loki, ELK, Datadog।
- Tracing — OpenTelemetry, Jaeger।
- ML-specific — Evidently, WhyLabs, Arize, Fiddler।
- Alerting — PagerDuty, Slack webhook।
Feedback Loop
- Prediction log → ground truth join → live metric।
- User feedback (thumbs up/down) capture।
- Drift detection → auto-retrain trigger।
- Shadow deployment দিয়ে candidate compare।
Common Mistakes
- শুধু accuracy track করা — latency, business KPI দরকার।
- Reference dataset update না করা।
- Alert fatigue — খুব sensitive threshold।
- Subgroup performance check না করা — bias miss।
Summary
এক নজরে
Monitoring = Operational + Data quality + Drift + Performance + Business। Prometheus/Grafana + Evidently = base stack।