Hook — প্রথম End-to-End Project
House price prediction হলো ML এর ‘Hello World’। কিন্তু সত্যিকারের project মানে শুধু `model.fit()` নয় — data, EDA, feature engineering, evaluation, deployment — সব মিলিয়ে একটা পূর্ণ pipeline।
Problem Definition
- Goal: ঘরের feature থেকে দাম predict করা।
- Type: Supervised Regression।
- Metric: RMSE / MAE / R²।
- Dataset: Kaggle ‘House Prices — Advanced Regression Techniques’ অথবা Ames Housing।
Project Pipeline
flow
1. Data load → 2. EDA → 3. Cleaning →
4. Feature Engineering → 5. Train/Val split →
6. Model selection → 7. Hyperparameter tuning →
8. Evaluation → 9. Save model → 10. Deploy (FastAPI)EDA — চোখে দেখো Data
- Target distribution — log transform দরকার কিনা দেখো।
- Missing value heatmap (seaborn)।
- Correlation matrix — top features বাছো।
- Outlier detection — boxplot / IQR।
eda.py
import pandas as pd, seaborn as sns, numpy as np
df = pd.read_csv("train.csv")
print(df.shape, df.isna().sum().sort_values(ascending=False).head(10))
sns.histplot(np.log1p(df["SalePrice"]))
corr = df.corr(numeric_only=True)["SalePrice"].sort_values(ascending=False)
print(corr.head(15))Feature Engineering
- Numeric: log transform skewed feature।
- Categorical: One-Hot বা Target Encoding।
- Combined: TotalSF = 1stFlr + 2ndFlr + Basement।
- Date: HouseAge = YrSold − YearBuilt।
- Missing: median (numeric), ‘None’ (categorical)।
Model Training
train.py
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from xgboost import XGBRegressor
import numpy as np
num = ["GrLivArea","TotalSF","HouseAge","OverallQual"]
cat = ["Neighborhood","HouseStyle"]
pre = ColumnTransformer([
("n", Pipeline([("imp", SimpleImputer(strategy="median")),
("sc", StandardScaler())]), num),
("c", Pipeline([("imp", SimpleImputer(strategy="most_frequent")),
("oh", OneHotEncoder(handle_unknown="ignore"))]), cat),
])
model = Pipeline([("pre", pre),
("xgb", XGBRegressor(n_estimators=800, learning_rate=0.05,
max_depth=4, subsample=0.8))])
y = np.log1p(df["SalePrice"])
scores = cross_val_score(model, df[num+cat], y, cv=KFold(5, shuffle=True, random_state=42),
scoring="neg_root_mean_squared_error")
print("CV RMSE:", -scores.mean())Evaluation
- RMSE log-space এ — easier to interpret।
- Residual plot — bias আছে কিনা দেখো।
- Predicted vs Actual scatter — y=x line এর সাথে compare।
- Feature importance — SHAP দিয়ে explain।
Deployment — FastAPI
app.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib, numpy as np
model = joblib.load("house_model.pkl")
app = FastAPI()
class House(BaseModel):
GrLivArea: float; TotalSF: float; HouseAge: int
OverallQual: int; Neighborhood: str; HouseStyle: str
@app.post("/predict")
def predict(h: House):
pred = model.predict([h.model_dump()])[0]
return {"price_usd": float(np.expm1(pred))}Summary
এক নজরে
House price = প্রথম পূর্ণ ML project। EDA → Feature → Pipeline → CV → SHAP → FastAPI — এই flow সবসময় কাজে লাগবে।