Hook — যা পাও সব রাখো
Warehouse এ data রাখতে আগে schema দরকার। কিন্তু image, video, JSON log, IoT — সব schema-friendly নয়। Data Lake বলে: আগে সব raw রেখে দাও, পরে যখন দরকার তখন process করো।
Lake vs Warehouse vs Lakehouse
- Warehouse — structured, schema-on-write, SQL fast (Snowflake, BigQuery, Redshift)।
- Lake — any format, schema-on-read, cheap storage (S3, GCS, ADLS)।
- Lakehouse — Lake এর সস্তায় Warehouse এর ACID + SQL (Delta, Iceberg, Hudi)।
Lake Zones — Medallion Architecture
- Bronze — raw, যেমন এসেছে।
- Silver — cleaned, deduped, typed।
- Gold — aggregated, business-ready।
File Format
- Parquet — columnar, compressed, default choice।
- ORC — Hadoop ecosystem।
- Avro — row-based, schema embedded।
- JSON / CSV — bronze zone only (slow)।
- Delta / Iceberg / Hudi — ACID table format।
Code — Delta Lake (PySpark)
delta_demo.py
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate())
# Bronze — raw ingest
raw = spark.read.json("s3://lake/bronze/events/")
raw.write.format("delta").mode("append").save("s3://lake/silver/events")
# Silver → Gold aggregate
silver = spark.read.format("delta").load("s3://lake/silver/events")
gold = (silver.groupBy("user_id", "day")
.agg({"amount": "sum"})
.withColumnRenamed("sum(amount)", "total"))
gold.write.format("delta").mode("overwrite").save("s3://lake/gold/user_daily")
# Time travel
old = spark.read.format("delta").option("versionAsOf", 5).load("s3://lake/silver/events")Code — DuckDB + Parquet (Light Lake)
duckdb_lake.py
import duckdb
con = duckdb.connect()
# S3 parquet সরাসরি query
con.execute("INSTALL httpfs; LOAD httpfs;")
df = con.execute("""
SELECT user_id, sum(amount) AS total
FROM read_parquet('s3://lake/silver/events/*.parquet')
WHERE day >= '2025-01-01'
GROUP BY 1 ORDER BY 2 DESC LIMIT 10
""").df()
print(df)Governance
- Catalog — Glue, Unity Catalog, Nessie।
- Access control — IAM, Lake Formation।
- PII masking, row-level security।
- Retention policy, lifecycle (S3 → Glacier)।
- Lineage — OpenLineage।
Common Mistakes — ‘Data Swamp’
Lake → Swamp
Catalog/governance ছাড়া lake হয়ে যায় ‘data swamp’ — কেউ জানে না কী আছে, কোথা থেকে এসেছে, trust নেই।
- Tiny file problem — অনেক ছোট file performance নষ্ট করে (compaction দরকার)।
- Schema evolution plan না করা।
- Partition column বাজে নির্বাচন (high cardinality)।
Summary
এক নজরে
Data Lake = cheap, flexible storage। Medallion (Bronze/Silver/Gold) + Parquet + Delta/Iceberg + catalog = production lakehouse।