Feature Stores — মেশিন লার্নিং

Hook — Feature এর Central Library

Team A user_avg_purchase compute করে notebook এ, Team B একই feature আবার বানায় slightly আলাদা ভাবে। Production এ training-serving skew। Feature Store এই সমস্যার সমাধান — একবার define করো, সবাই reuse করে, training ও serving এ same code।

Feature Store কেন?

Reusability — feature share between team/model।
Consistency — training-serving skew বন্ধ।
Point-in-time correctness — historical leakage এড়ায়।
Low-latency online serving (Redis/DynamoDB)।
Lineage, governance, monitoring।

Architecture

Offline Store — historical feature (Parquet/BigQuery)।
Online Store — low-latency lookup (Redis, DynamoDB)।
Registry — feature definition + metadata।
Materialization — offline → online sync।
Serving SDK — model code এ ব্যবহার।

Point-in-Time Join

Training এ ‘সেই সময়ে কী জানা ছিল’ সেটা সঠিক রাখাই point-in-time join। Future leakage বন্ধ — production এর simulation।

pit join

SELECT e.event_ts, e.user_id, e.label,
       f.purchases_30d, f.avg_basket
FROM events e
LEFT JOIN user_features f
  ON  f.user_id    = e.user_id
  AND f.feature_ts <= e.event_ts
  AND f.feature_ts >  e.event_ts - INTERVAL '7 days'

Popular Feature Stores

Feast — open source, lightweight।
Tecton — managed, enterprise।
SageMaker Feature Store — AWS।
Vertex AI Feature Store — GCP।
Databricks Feature Store — Lakehouse integrated।
Hopsworks — open source enterprise।

Code — Feast Example

feature_repo.py

from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64
from datetime import timedelta

user = Entity(name="user_id", join_keys=["user_id"])

source = FileSource(
    path="s3://lake/gold/user_features.parquet",
    timestamp_field="event_ts",
)

user_stats = FeatureView(
    name="user_stats",
    entities=[user],
    ttl=timedelta(days=7),
    schema=[
        Field(name="purchases_30d", dtype=Int64),
        Field(name="avg_basket",    dtype=Float32),
    ],
    source=source,
    online=True,
)

Training Data + Online Serving

use_feast.py

from feast import FeatureStore
fs = FeatureStore(repo_path=".")

# Training — point-in-time join
training_df = fs.get_historical_features(
    entity_df=event_log_df,            # user_id, event_ts, label
    features=["user_stats:purchases_30d", "user_stats:avg_basket"],
).to_df()

# Serving — online lookup (<10ms)
features = fs.get_online_features(
    features=["user_stats:purchases_30d", "user_stats:avg_basket"],
    entity_rows=[{"user_id": 42}],
).to_dict()

Best Practices

Feature naming convention (entity__feature__window)।
Versioning — backward-compatible change।
Freshness monitoring।
Online/offline parity test।
On-demand transformation — request-time feature।
Cost — শুধু hot feature online এ রাখো।

Common Mistakes

Point-in-time join skip — future leakage।
Training pandas, serving SQL — skew।
TTL ভুল — stale feature।
Online store এ সব feature ঢালা — খরচ আকাশছোঁয়া।

Summary

এক নজরে

Feature Store = একবার define, সর্বত্র reuse। Offline + Online + Registry + PIT join = training-serving skew এর মৃত্যু।