Data Warehouses — মেশিন লার্নিং

Hook — Analytics এর Engine

Production database (OLTP) এ অসংখ্য ছোট transaction হয় — analytics query চালালে slow। Data Warehouse (OLAP) সেই দ্বন্দ্ব সমাধান করে — billion row aggregate seconds এ।

OLTP vs OLAP

OLTP — row-based, short transaction (Postgres, MySQL)।
OLAP — column-based, aggregate scan (Snowflake, BigQuery, Redshift, ClickHouse)।
Columnar = একই column এর সব value পাশাপাশি = compression + scan ভালো।

Data Modeling

Star Schema — center এ fact, পাশে dimension।
Snowflake Schema — dimension আবার normalize।
Data Vault — flexible enterprise model।
One Big Table (OBT) — columnar এ দ্রুত, dbt friendly।

Major Warehouses

Snowflake — multi-cloud, separation of storage/compute।
BigQuery — serverless, pay-per-query।
Redshift — AWS native।
Databricks SQL — Lakehouse।
ClickHouse — open source, blazing fast।
Postgres + DuckDB — small/medium scale।

Code — Star Schema SQL

schema.sql

-- Dimension tables
CREATE TABLE dim_customer (
    customer_id INT PRIMARY KEY,
    name        TEXT,
    country     TEXT,
    signup_date DATE
);

CREATE TABLE dim_product (
    product_id INT PRIMARY KEY,
    name       TEXT,
    category   TEXT,
    price      DECIMAL
);

CREATE TABLE dim_date (
    date_id   DATE PRIMARY KEY,
    year      INT, month INT, day INT,
    weekday   TEXT, is_weekend BOOLEAN
);

-- Fact table
CREATE TABLE fact_sales (
    sale_id     BIGINT PRIMARY KEY,
    date_id     DATE  REFERENCES dim_date,
    customer_id INT   REFERENCES dim_customer,
    product_id  INT   REFERENCES dim_product,
    quantity    INT,
    revenue     DECIMAL
);

-- Analytical query
SELECT d.month, p.category, sum(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_date d    ON f.date_id = d.date_id
JOIN dim_product p ON f.product_id = p.product_id
WHERE d.year = 2025
GROUP BY 1, 2
ORDER BY 1, 3 DESC;

Performance Tricks

Partitioning — date column দিয়ে scan কমাও।
Clustering / Z-ordering — co-located related row।
Materialized view — pre-aggregated result।
Approximate functions — APPROX_COUNT_DISTINCT।
Query result cache।
Result reuse / incremental model (dbt)।

Cost Control

SELECT * এড়াও — শুধু দরকারি column।
Partition filter সবসময় add করো।
Auto-suspend warehouse (Snowflake)।
Query monitoring + budget alert।
Workload — analyst vs ETL vs ML আলাদা warehouse।

ML Integration

BigQuery ML, Snowflake Cortex — SQL এ-ই model।
dbt + ML — feature view।
Reverse ETL — warehouse → operational tool (Hightouch, Census)।

Summary

এক নজরে

Warehouse = OLAP, columnar, star schema। Snowflake/BigQuery + dbt + materialized view = scalable analytics + ML feature pipeline।