Hook — Feature এর Central Library
Team A user_avg_purchase compute করে notebook এ, Team B একই feature আবার বানায় slightly আলাদা ভাবে। Production এ training-serving skew। Feature Store এই সমস্যার সমাধান — একবার define করো, সবাই reuse করে, training ও serving এ same code।
Feature Store কেন?
- Reusability — feature share between team/model।
- Consistency — training-serving skew বন্ধ।
- Point-in-time correctness — historical leakage এড়ায়।
- Low-latency online serving (Redis/DynamoDB)।
- Lineage, governance, monitoring।
Architecture
- Offline Store — historical feature (Parquet/BigQuery)।
- Online Store — low-latency lookup (Redis, DynamoDB)।
- Registry — feature definition + metadata।
- Materialization — offline → online sync।
- Serving SDK — model code এ ব্যবহার।
Point-in-Time Join
Training এ ‘সেই সময়ে কী জানা ছিল’ সেটা সঠিক রাখাই point-in-time join। Future leakage বন্ধ — production এর simulation।
pit join
SELECT e.event_ts, e.user_id, e.label,
f.purchases_30d, f.avg_basket
FROM events e
LEFT JOIN user_features f
ON f.user_id = e.user_id
AND f.feature_ts <= e.event_ts
AND f.feature_ts > e.event_ts - INTERVAL '7 days'Popular Feature Stores
- Feast — open source, lightweight।
- Tecton — managed, enterprise।
- SageMaker Feature Store — AWS।
- Vertex AI Feature Store — GCP।
- Databricks Feature Store — Lakehouse integrated।
- Hopsworks — open source enterprise।
Code — Feast Example
feature_repo.py
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64
from datetime import timedelta
user = Entity(name="user_id", join_keys=["user_id"])
source = FileSource(
path="s3://lake/gold/user_features.parquet",
timestamp_field="event_ts",
)
user_stats = FeatureView(
name="user_stats",
entities=[user],
ttl=timedelta(days=7),
schema=[
Field(name="purchases_30d", dtype=Int64),
Field(name="avg_basket", dtype=Float32),
],
source=source,
online=True,
)Training Data + Online Serving
use_feast.py
from feast import FeatureStore
fs = FeatureStore(repo_path=".")
# Training — point-in-time join
training_df = fs.get_historical_features(
entity_df=event_log_df, # user_id, event_ts, label
features=["user_stats:purchases_30d", "user_stats:avg_basket"],
).to_df()
# Serving — online lookup (<10ms)
features = fs.get_online_features(
features=["user_stats:purchases_30d", "user_stats:avg_basket"],
entity_rows=[{"user_id": 42}],
).to_dict()Best Practices
- Feature naming convention (entity__feature__window)।
- Versioning — backward-compatible change।
- Freshness monitoring।
- Online/offline parity test।
- On-demand transformation — request-time feature।
- Cost — শুধু hot feature online এ রাখো।
Common Mistakes
- Point-in-time join skip — future leakage।
- Training pandas, serving SQL — skew।
- TTL ভুল — stale feature।
- Online store এ সব feature ঢালা — খরচ আকাশছোঁয়া।
Summary
এক নজরে
Feature Store = একবার define, সর্বত্র reuse। Offline + Online + Registry + PIT join = training-serving skew এর মৃত্যু।