Hook — Trial & Error দিয়ে শেখা
Supervised learning এ label আছে, RL এ নেই — শুধু reward আছে। Agent action নেয়, environment reward দেয়, agent শেখে কীভাবে long-term reward সর্বোচ্চ করা যায়। AlphaGo, ChatGPT (RLHF), robotics — সব RL এর প্রয়োগ।
MDP — Markov Decision Process
- State (s) — বর্তমান অবস্থা।
- Action (a) — কী করবে।
- Reward (r) — environment এর feedback।
- Policy π(a|s) — state এ কোন action।
- Value V(s) / Q(s,a) — long-term expected reward।
Gₜ = rₜ₊₁ + γ·rₜ₊₂ + γ²·rₜ₊₃ + ... (γ = discount factor)
Exploration vs Exploitation
জানা ভালো option (exploit) নাকি নতুন কিছু চেষ্টা (explore)? ε-greedy, Softmax, UCB — balance করার কৌশল।
Major Algorithms
- Q-Learning — tabular, off-policy।
- DQN — Deep Q-Network (Atari)।
- Policy Gradient — REINFORCE।
- Actor-Critic — A2C, A3C।
- PPO — stable, OpenAI default।
- SAC — continuous action।
Q(s,a) ← Q(s,a) + α·(r + γ·max Q(s',a') − Q(s,a))
Code — Gymnasium + Stable Baselines3
ppo_cartpole.py
import gymnasium as gym
from stable_baselines3 import PPO
env = gym.make("CartPole-v1")
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=20_000)
obs, _ = env.reset()
for _ in range(500):
action, _ = model.predict(obs)
obs, reward, terminated, truncated, _ = env.step(action)
if terminated or truncated: obs, _ = env.reset()Tabular Q-Learning (FrozenLake)
q_learning.py
import numpy as np, gymnasium as gym
env = gym.make("FrozenLake-v1", is_slippery=False)
Q = np.zeros((env.observation_space.n, env.action_space.n))
alpha, gamma, eps = 0.8, 0.95, 0.1
for ep in range(2000):
s, _ = env.reset(); done = False
while not done:
a = env.action_space.sample() if np.random.rand() < eps else np.argmax(Q[s])
s2, r, term, trunc, _ = env.step(a)
Q[s,a] += alpha * (r + gamma * np.max(Q[s2]) - Q[s,a])
s = s2; done = term or truncApplications
- Game AI — AlphaGo, AlphaStar, OpenAI Five।
- Robotics — manipulation, locomotion।
- RLHF — LLM alignment (ChatGPT)।
- Recommender — long-term user engagement।
- Trading, energy management, traffic light।
Common Mistakes
- Reward hacking — agent unintended way তে reward বাড়ায়।
- Sample inefficient — বহু episode লাগে।
- Sparse reward — শেখা প্রায় অসম্ভব, reward shaping দরকার।
Summary
এক নজরে
RL = State, Action, Reward, Policy। Q-Learning দিয়ে শুরু, PPO দিয়ে production, RLHF দিয়ে LLM alignment।