📚 সমস্ত অধ্যায় দেখুন
অধ্যায়/ফেজ 8 · Phase 8 · Data Engineering
8.4২৫ মিনিট পড়া51 / 68

Feature Stores

Feature Store

Reusable feature management।

Hook — Feature এর Central Library

Team A user_avg_purchase compute করে notebook এ, Team B একই feature আবার বানায় slightly আলাদা ভাবে। Production এ training-serving skew। Feature Store এই সমস্যার সমাধান — একবার define করো, সবাই reuse করে, training ও serving এ same code।

Feature Store কেন?

  • Reusability — feature share between team/model।
  • Consistency — training-serving skew বন্ধ।
  • Point-in-time correctness — historical leakage এড়ায়।
  • Low-latency online serving (Redis/DynamoDB)।
  • Lineage, governance, monitoring।

Architecture

  • Offline Store — historical feature (Parquet/BigQuery)।
  • Online Store — low-latency lookup (Redis, DynamoDB)।
  • Registry — feature definition + metadata।
  • Materialization — offline → online sync।
  • Serving SDK — model code এ ব্যবহার।

Point-in-Time Join

Training এ ‘সেই সময়ে কী জানা ছিল’ সেটা সঠিক রাখাই point-in-time join। Future leakage বন্ধ — production এর simulation।

pit join
SELECT e.event_ts, e.user_id, e.label,
       f.purchases_30d, f.avg_basket
FROM events e
LEFT JOIN user_features f
  ON  f.user_id    = e.user_id
  AND f.feature_ts <= e.event_ts
  AND f.feature_ts >  e.event_ts - INTERVAL '7 days'

Popular Feature Stores

  • Feast — open source, lightweight।
  • Tecton — managed, enterprise।
  • SageMaker Feature Store — AWS।
  • Vertex AI Feature Store — GCP।
  • Databricks Feature Store — Lakehouse integrated।
  • Hopsworks — open source enterprise।

Code — Feast Example

feature_repo.py
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64
from datetime import timedelta

user = Entity(name="user_id", join_keys=["user_id"])

source = FileSource(
    path="s3://lake/gold/user_features.parquet",
    timestamp_field="event_ts",
)

user_stats = FeatureView(
    name="user_stats",
    entities=[user],
    ttl=timedelta(days=7),
    schema=[
        Field(name="purchases_30d", dtype=Int64),
        Field(name="avg_basket",    dtype=Float32),
    ],
    source=source,
    online=True,
)

Training Data + Online Serving

use_feast.py
from feast import FeatureStore
fs = FeatureStore(repo_path=".")

# Training — point-in-time join
training_df = fs.get_historical_features(
    entity_df=event_log_df,            # user_id, event_ts, label
    features=["user_stats:purchases_30d", "user_stats:avg_basket"],
).to_df()

# Serving — online lookup (<10ms)
features = fs.get_online_features(
    features=["user_stats:purchases_30d", "user_stats:avg_basket"],
    entity_rows=[{"user_id": 42}],
).to_dict()

Best Practices

  • Feature naming convention (entity__feature__window)।
  • Versioning — backward-compatible change।
  • Freshness monitoring।
  • Online/offline parity test।
  • On-demand transformation — request-time feature।
  • Cost — শুধু hot feature online এ রাখো।

Common Mistakes

  • Point-in-time join skip — future leakage।
  • Training pandas, serving SQL — skew।
  • TTL ভুল — stale feature।
  • Online store এ সব feature ঢালা — খরচ আকাশছোঁয়া।

Summary

এক নজরে

Feature Store = একবার define, সর্বত্র reuse। Offline + Online + Registry + PIT join = training-serving skew এর মৃত্যু।