House Price Prediction — মেশিন লার্নিং

Hook — প্রথম End-to-End Project

House price prediction হলো ML এর ‘Hello World’। কিন্তু সত্যিকারের project মানে শুধু `model.fit()` নয় — data, EDA, feature engineering, evaluation, deployment — সব মিলিয়ে একটা পূর্ণ pipeline।

Problem Definition

Goal: ঘরের feature থেকে দাম predict করা।
Type: Supervised Regression।
Metric: RMSE / MAE / R²।
Dataset: Kaggle ‘House Prices — Advanced Regression Techniques’ অথবা Ames Housing।

Project Pipeline

flow

1. Data load → 2. EDA → 3. Cleaning →
4. Feature Engineering → 5. Train/Val split →
6. Model selection → 7. Hyperparameter tuning →
8. Evaluation → 9. Save model → 10. Deploy (FastAPI)

EDA — চোখে দেখো Data

Target distribution — log transform দরকার কিনা দেখো।
Missing value heatmap (seaborn)।
Correlation matrix — top features বাছো।
Outlier detection — boxplot / IQR।

eda.py

import pandas as pd, seaborn as sns, numpy as np
df = pd.read_csv("train.csv")
print(df.shape, df.isna().sum().sort_values(ascending=False).head(10))
sns.histplot(np.log1p(df["SalePrice"]))
corr = df.corr(numeric_only=True)["SalePrice"].sort_values(ascending=False)
print(corr.head(15))

Feature Engineering

Numeric: log transform skewed feature।
Categorical: One-Hot বা Target Encoding।
Combined: TotalSF = 1stFlr + 2ndFlr + Basement।
Date: HouseAge = YrSold − YearBuilt।
Missing: median (numeric), ‘None’ (categorical)।

Model Training

train.py

from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from xgboost import XGBRegressor
import numpy as np

num = ["GrLivArea","TotalSF","HouseAge","OverallQual"]
cat = ["Neighborhood","HouseStyle"]

pre = ColumnTransformer([
    ("n", Pipeline([("imp", SimpleImputer(strategy="median")),
                    ("sc", StandardScaler())]), num),
    ("c", Pipeline([("imp", SimpleImputer(strategy="most_frequent")),
                    ("oh", OneHotEncoder(handle_unknown="ignore"))]), cat),
])

model = Pipeline([("pre", pre),
                  ("xgb", XGBRegressor(n_estimators=800, learning_rate=0.05,
                                       max_depth=4, subsample=0.8))])

y = np.log1p(df["SalePrice"])
scores = cross_val_score(model, df[num+cat], y, cv=KFold(5, shuffle=True, random_state=42),
                         scoring="neg_root_mean_squared_error")
print("CV RMSE:", -scores.mean())

Evaluation

RMSE log-space এ — easier to interpret।
Residual plot — bias আছে কিনা দেখো।
Predicted vs Actual scatter — y=x line এর সাথে compare।
Feature importance — SHAP দিয়ে explain।

Deployment — FastAPI

app.py

from fastapi import FastAPI
from pydantic import BaseModel
import joblib, numpy as np

model = joblib.load("house_model.pkl")
app = FastAPI()

class House(BaseModel):
    GrLivArea: float; TotalSF: float; HouseAge: int
    OverallQual: int; Neighborhood: str; HouseStyle: str

@app.post("/predict")
def predict(h: House):
    pred = model.predict([h.model_dump()])[0]
    return {"price_usd": float(np.expm1(pred))}

Summary

এক নজরে

House price = প্রথম পূর্ণ ML project। EDA → Feature → Pipeline → CV → SHAP → FastAPI — এই flow সবসময় কাজে লাগবে।