📚 সমস্ত অধ্যায় দেখুন
অধ্যায়/ফেজ 8 · Phase 8 · Data Engineering
8.2২৫ মিনিট পড়া49 / 68

Data Lakes

Data Lake

Raw data কে centrally রাখা।

Hook — যা পাও সব রাখো

Warehouse এ data রাখতে আগে schema দরকার। কিন্তু image, video, JSON log, IoT — সব schema-friendly নয়। Data Lake বলে: আগে সব raw রেখে দাও, পরে যখন দরকার তখন process করো।

Lake vs Warehouse vs Lakehouse

  • Warehouse — structured, schema-on-write, SQL fast (Snowflake, BigQuery, Redshift)।
  • Lake — any format, schema-on-read, cheap storage (S3, GCS, ADLS)।
  • Lakehouse — Lake এর সস্তায় Warehouse এর ACID + SQL (Delta, Iceberg, Hudi)।

Lake Zones — Medallion Architecture

  • Bronze — raw, যেমন এসেছে।
  • Silver — cleaned, deduped, typed।
  • Gold — aggregated, business-ready।

File Format

  • Parquet — columnar, compressed, default choice।
  • ORC — Hadoop ecosystem।
  • Avro — row-based, schema embedded।
  • JSON / CSV — bronze zone only (slow)।
  • Delta / Iceberg / Hudi — ACID table format।

Code — Delta Lake (PySpark)

delta_demo.py
from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .getOrCreate())

# Bronze — raw ingest
raw = spark.read.json("s3://lake/bronze/events/")
raw.write.format("delta").mode("append").save("s3://lake/silver/events")

# Silver → Gold aggregate
silver = spark.read.format("delta").load("s3://lake/silver/events")
gold = (silver.groupBy("user_id", "day")
              .agg({"amount": "sum"})
              .withColumnRenamed("sum(amount)", "total"))
gold.write.format("delta").mode("overwrite").save("s3://lake/gold/user_daily")

# Time travel
old = spark.read.format("delta").option("versionAsOf", 5).load("s3://lake/silver/events")

Code — DuckDB + Parquet (Light Lake)

duckdb_lake.py
import duckdb
con = duckdb.connect()
# S3 parquet সরাসরি query
con.execute("INSTALL httpfs; LOAD httpfs;")
df = con.execute("""
    SELECT user_id, sum(amount) AS total
    FROM read_parquet('s3://lake/silver/events/*.parquet')
    WHERE day >= '2025-01-01'
    GROUP BY 1 ORDER BY 2 DESC LIMIT 10
""").df()
print(df)

Governance

  • Catalog — Glue, Unity Catalog, Nessie।
  • Access control — IAM, Lake Formation।
  • PII masking, row-level security।
  • Retention policy, lifecycle (S3 → Glacier)।
  • Lineage — OpenLineage।

Common Mistakes — ‘Data Swamp’

Lake → Swamp

Catalog/governance ছাড়া lake হয়ে যায় ‘data swamp’ — কেউ জানে না কী আছে, কোথা থেকে এসেছে, trust নেই।

  • Tiny file problem — অনেক ছোট file performance নষ্ট করে (compaction দরকার)।
  • Schema evolution plan না করা।
  • Partition column বাজে নির্বাচন (high cardinality)।

Summary

এক নজরে

Data Lake = cheap, flexible storage। Medallion (Bronze/Silver/Gold) + Parquet + Delta/Iceberg + catalog = production lakehouse।