Hook — Raw data থেকে Insight
ML model এর খাবার = data। কিন্তু সেই data আসে অনেক জায়গা থেকে — DB, API, log, CSV, Kafka। ETL (Extract-Transform-Load) সেই data কে clean, transform করে warehouse/lake এ পৌঁছে দেয়।
ETL vs ELT
- ETL — Transform → Load (traditional, on-prem)।
- ELT — Load raw → Transform inside warehouse (modern cloud, Snowflake/BigQuery/dbt)।
- Streaming ETL — real-time (Kafka, Flink)।
ETL এর Steps
- Extract — source থেকে data টানা (DB query, API, file)।
- Transform — clean, enrich, join, aggregate।
- Load — destination এ লেখা (warehouse, lake, feature store)।
- Validate — schema, row count, null check।
- Schedule — Airflow/Prefect/Dagster দিয়ে।
Code — Airflow DAG
dags/sales_etl.py
from airflow import DAG
from airflow.decorators import task
from datetime import datetime
import pandas as pd
with DAG("sales_etl", start_date=datetime(2025, 1, 1),
schedule="@daily", catchup=False) as dag:
@task
def extract():
return pd.read_csv("s3://raw/sales/{{ ds }}.csv").to_dict()
@task
def transform(raw):
df = pd.DataFrame(raw)
df["revenue"] = df["price"] * df["qty"]
df = df.dropna(subset=["customer_id"])
return df.to_dict()
@task
def load(clean):
pd.DataFrame(clean).to_parquet("s3://warehouse/sales/{{ ds }}.parquet")
load(transform(extract()))Code — dbt (SQL Transformation)
models/marts/daily_sales.sql
{{ config(materialized='incremental', unique_key='date') }}
select
date_trunc('day', created_at) as date,
sum(price * qty) as revenue,
count(distinct customer_id) as customers,
count(*) as orders
from {{ ref('stg_orders') }}
{% if is_incremental() %}
where created_at > (select max(date) from {{ this }})
{% endif %}
group by 1Streaming ETL
- Kafka — event bus।
- Flink / Spark Streaming — stateful processing।
- Materialize / RisingWave — streaming SQL।
- Use case — fraud detection, real-time feature।
Modern Stack
- Orchestration — Airflow, Prefect, Dagster, Mage।
- Ingestion — Fivetran, Airbyte, Stitch।
- Transformation — dbt, SQLMesh।
- Quality — Great Expectations, dbt tests, Soda।
- Catalog — DataHub, OpenMetadata।
Best Practices
- Idempotent task — পুনরায় চালালে একই result।
- Schema versioning।
- Partition by date — pruning ও replay সহজ।
- Late-arriving data handle (watermark)।
- Lineage tracking।
- Backfill এর strategy ঠিক রাখো।
Summary
এক নজরে
ETL/ELT = data কে ML-ready করার pipeline। Airflow + dbt + Parquet + warehouse = আজকের modern stack।