📚 সমস্ত অধ্যায় দেখুন
অধ্যায়/ফেজ 8 · Phase 8 · Data Engineering
8.1২৫ মিনিট পড়া48 / 68

ETL Pipelines

ETL

Extract, Transform, Load প্রক্রিয়া।

Hook — Raw data থেকে Insight

ML model এর খাবার = data। কিন্তু সেই data আসে অনেক জায়গা থেকে — DB, API, log, CSV, Kafka। ETL (Extract-Transform-Load) সেই data কে clean, transform করে warehouse/lake এ পৌঁছে দেয়।

ETL vs ELT

  • ETL — Transform → Load (traditional, on-prem)।
  • ELT — Load raw → Transform inside warehouse (modern cloud, Snowflake/BigQuery/dbt)।
  • Streaming ETL — real-time (Kafka, Flink)।

ETL এর Steps

  • Extract — source থেকে data টানা (DB query, API, file)।
  • Transform — clean, enrich, join, aggregate।
  • Load — destination এ লেখা (warehouse, lake, feature store)।
  • Validate — schema, row count, null check।
  • Schedule — Airflow/Prefect/Dagster দিয়ে।

Code — Airflow DAG

dags/sales_etl.py
from airflow import DAG
from airflow.decorators import task
from datetime import datetime
import pandas as pd

with DAG("sales_etl", start_date=datetime(2025, 1, 1),
         schedule="@daily", catchup=False) as dag:

    @task
    def extract():
        return pd.read_csv("s3://raw/sales/{{ ds }}.csv").to_dict()

    @task
    def transform(raw):
        df = pd.DataFrame(raw)
        df["revenue"] = df["price"] * df["qty"]
        df = df.dropna(subset=["customer_id"])
        return df.to_dict()

    @task
    def load(clean):
        pd.DataFrame(clean).to_parquet("s3://warehouse/sales/{{ ds }}.parquet")

    load(transform(extract()))

Code — dbt (SQL Transformation)

models/marts/daily_sales.sql
{{ config(materialized='incremental', unique_key='date') }}

select
    date_trunc('day', created_at) as date,
    sum(price * qty)              as revenue,
    count(distinct customer_id)   as customers,
    count(*)                      as orders
from {{ ref('stg_orders') }}
{% if is_incremental() %}
where created_at > (select max(date) from {{ this }})
{% endif %}
group by 1

Streaming ETL

  • Kafka — event bus।
  • Flink / Spark Streaming — stateful processing।
  • Materialize / RisingWave — streaming SQL।
  • Use case — fraud detection, real-time feature।

Modern Stack

  • Orchestration — Airflow, Prefect, Dagster, Mage।
  • Ingestion — Fivetran, Airbyte, Stitch।
  • Transformation — dbt, SQLMesh।
  • Quality — Great Expectations, dbt tests, Soda।
  • Catalog — DataHub, OpenMetadata।

Best Practices

  • Idempotent task — পুনরায় চালালে একই result।
  • Schema versioning।
  • Partition by date — pruning ও replay সহজ।
  • Late-arriving data handle (watermark)।
  • Lineage tracking।
  • Backfill এর strategy ঠিক রাখো।

Summary

এক নজরে

ETL/ELT = data কে ML-ready করার pipeline। Airflow + dbt + Parquet + warehouse = আজকের modern stack।