📚 সমস্ত অধ্যায় দেখুন
অধ্যায়/ফেজ 8 · Phase 8 · Data Engineering
8.3২৫ মিনিট পড়া50 / 68

Data Warehouses

Data Warehouse

Analytics এর জন্য structured store।

Hook — Analytics এর Engine

Production database (OLTP) এ অসংখ্য ছোট transaction হয় — analytics query চালালে slow। Data Warehouse (OLAP) সেই দ্বন্দ্ব সমাধান করে — billion row aggregate seconds এ।

OLTP vs OLAP

  • OLTP — row-based, short transaction (Postgres, MySQL)।
  • OLAP — column-based, aggregate scan (Snowflake, BigQuery, Redshift, ClickHouse)।
  • Columnar = একই column এর সব value পাশাপাশি = compression + scan ভালো।

Data Modeling

  • Star Schema — center এ fact, পাশে dimension।
  • Snowflake Schema — dimension আবার normalize।
  • Data Vault — flexible enterprise model।
  • One Big Table (OBT) — columnar এ দ্রুত, dbt friendly।

Major Warehouses

  • Snowflake — multi-cloud, separation of storage/compute।
  • BigQuery — serverless, pay-per-query।
  • Redshift — AWS native।
  • Databricks SQL — Lakehouse।
  • ClickHouse — open source, blazing fast।
  • Postgres + DuckDB — small/medium scale।

Code — Star Schema SQL

schema.sql
-- Dimension tables
CREATE TABLE dim_customer (
    customer_id INT PRIMARY KEY,
    name        TEXT,
    country     TEXT,
    signup_date DATE
);

CREATE TABLE dim_product (
    product_id INT PRIMARY KEY,
    name       TEXT,
    category   TEXT,
    price      DECIMAL
);

CREATE TABLE dim_date (
    date_id   DATE PRIMARY KEY,
    year      INT, month INT, day INT,
    weekday   TEXT, is_weekend BOOLEAN
);

-- Fact table
CREATE TABLE fact_sales (
    sale_id     BIGINT PRIMARY KEY,
    date_id     DATE  REFERENCES dim_date,
    customer_id INT   REFERENCES dim_customer,
    product_id  INT   REFERENCES dim_product,
    quantity    INT,
    revenue     DECIMAL
);

-- Analytical query
SELECT d.month, p.category, sum(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_date d    ON f.date_id = d.date_id
JOIN dim_product p ON f.product_id = p.product_id
WHERE d.year = 2025
GROUP BY 1, 2
ORDER BY 1, 3 DESC;

Performance Tricks

  • Partitioning — date column দিয়ে scan কমাও।
  • Clustering / Z-ordering — co-located related row।
  • Materialized view — pre-aggregated result।
  • Approximate functions — APPROX_COUNT_DISTINCT।
  • Query result cache।
  • Result reuse / incremental model (dbt)।

Cost Control

  • SELECT * এড়াও — শুধু দরকারি column।
  • Partition filter সবসময় add করো।
  • Auto-suspend warehouse (Snowflake)।
  • Query monitoring + budget alert।
  • Workload — analyst vs ETL vs ML আলাদা warehouse।

ML Integration

  • BigQuery ML, Snowflake Cortex — SQL এ-ই model।
  • dbt + ML — feature view।
  • Reverse ETL — warehouse → operational tool (Hightouch, Census)।

Summary

এক নজরে

Warehouse = OLAP, columnar, star schema। Snowflake/BigQuery + dbt + materialized view = scalable analytics + ML feature pipeline।