📚 সমস্ত অধ্যায় দেখুন
অধ্যায়/ফেজ 7 · Phase 7 · MLOps
7.4২৫ মিনিট পড়া46 / 68

Model Monitoring

Monitoring

Drift, latency, alerting।

Hook — Deploy = শেষ নয়, শুরু

World বদলায়, user বদলায়, data বদলায় — কিন্তু model fixed। Monitoring ছাড়া তুমি জানবেই না কখন model পচে গেছে। Production ML এর সবচেয়ে underrated অংশ।

কী Monitor করব?

  • Operational — latency (p50/p95/p99), throughput, error rate, uptime।
  • Data quality — missing, range violation, schema break।
  • Data Drift — input distribution change।
  • Concept Drift — input-output সম্পর্ক change।
  • Model performance — accuracy, AUC (label এলে)।
  • Business KPI — conversion, revenue।
  • Fairness — subgroup performance।

Data vs Concept Drift

পার্থক্য

Data drift = P(X) বদলেছে (user demographic shift)। Concept drift = P(y|X) বদলেছে (covid এর পর spending pattern বদলেছে)।

  • Test — KS test, PSI (Population Stability Index), Wasserstein।
  • PSI < 0.1 = stable, 0.1-0.25 = moderate, >0.25 = alert।

Code — Evidently দিয়ে Drift Report

drift.py
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, RegressionPreset

report = Report(metrics=[DataDriftPreset(), RegressionPreset()])
report.run(reference_data=ref_df, current_data=prod_df)
report.save_html("drift_report.html")
result = report.as_dict()
if result["metrics"][0]["result"]["dataset_drift"]:
    alert_slack("⚠️ Data drift detected")

Prometheus Metrics from FastAPI

metrics.py
from prometheus_fastapi_instrumentator import Instrumentator
Instrumentator().instrument(app).expose(app, endpoint="/metrics")

from prometheus_client import Counter, Histogram
PRED = Counter("predictions_total", "Total predictions", ["class"])
LAT  = Histogram("predict_latency_seconds", "Latency")

@app.post("/predict")
def predict(f: Features):
    with LAT.time():
        out = model.predict(...)
        PRED.labels(class_=out).inc()
        return out

Monitoring Stack

  • Metrics — Prometheus + Grafana।
  • Logs — Loki, ELK, Datadog।
  • Tracing — OpenTelemetry, Jaeger।
  • ML-specific — Evidently, WhyLabs, Arize, Fiddler।
  • Alerting — PagerDuty, Slack webhook।

Feedback Loop

  • Prediction log → ground truth join → live metric।
  • User feedback (thumbs up/down) capture।
  • Drift detection → auto-retrain trigger।
  • Shadow deployment দিয়ে candidate compare।

Common Mistakes

  • শুধু accuracy track করা — latency, business KPI দরকার।
  • Reference dataset update না করা।
  • Alert fatigue — খুব sensitive threshold।
  • Subgroup performance check না করা — bias miss।

Summary

এক নজরে

Monitoring = Operational + Data quality + Drift + Performance + Business। Prometheus/Grafana + Evidently = base stack।