Reinforcement Learning — মেশিন লার্নিং

Hook — Trial & Error দিয়ে শেখা

Supervised learning এ label আছে, RL এ নেই — শুধু reward আছে। Agent action নেয়, environment reward দেয়, agent শেখে কীভাবে long-term reward সর্বোচ্চ করা যায়। AlphaGo, ChatGPT (RLHF), robotics — সব RL এর প্রয়োগ।

MDP — Markov Decision Process

State (s) — বর্তমান অবস্থা।
Action (a) — কী করবে।
Reward (r) — environment এর feedback।
Policy π(a|s) — state এ কোন action।
Value V(s) / Q(s,a) — long-term expected reward।

Gₜ = rₜ₊₁ + γ·rₜ₊₂ + γ²·rₜ₊₃ + ... (γ = discount factor)

Exploration vs Exploitation

জানা ভালো option (exploit) নাকি নতুন কিছু চেষ্টা (explore)? ε-greedy, Softmax, UCB — balance করার কৌশল।

Major Algorithms

Q-Learning — tabular, off-policy।
DQN — Deep Q-Network (Atari)।
Policy Gradient — REINFORCE।
Actor-Critic — A2C, A3C।
PPO — stable, OpenAI default।
SAC — continuous action।

Q(s,a) ← Q(s,a) + α·(r + γ·max Q(s',a') − Q(s,a))

Code — Gymnasium + Stable Baselines3

ppo_cartpole.py

import gymnasium as gym
from stable_baselines3 import PPO

env = gym.make("CartPole-v1")
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=20_000)

obs, _ = env.reset()
for _ in range(500):
    action, _ = model.predict(obs)
    obs, reward, terminated, truncated, _ = env.step(action)
    if terminated or truncated: obs, _ = env.reset()

Tabular Q-Learning (FrozenLake)

q_learning.py

import numpy as np, gymnasium as gym
env = gym.make("FrozenLake-v1", is_slippery=False)
Q = np.zeros((env.observation_space.n, env.action_space.n))
alpha, gamma, eps = 0.8, 0.95, 0.1

for ep in range(2000):
    s, _ = env.reset(); done = False
    while not done:
        a = env.action_space.sample() if np.random.rand() < eps else np.argmax(Q[s])
        s2, r, term, trunc, _ = env.step(a)
        Q[s,a] += alpha * (r + gamma * np.max(Q[s2]) - Q[s,a])
        s = s2; done = term or trunc

Applications

Game AI — AlphaGo, AlphaStar, OpenAI Five।
Robotics — manipulation, locomotion।
RLHF — LLM alignment (ChatGPT)।
Recommender — long-term user engagement।
Trading, energy management, traffic light।

Common Mistakes

Reward hacking — agent unintended way তে reward বাড়ায়।
Sample inefficient — বহু episode লাগে।
Sparse reward — শেখা প্রায় অসম্ভব, reward shaping দরকার।

Summary

এক নজরে

RL = State, Action, Reward, Policy। Q-Learning দিয়ে শুরু, PPO দিয়ে production, RLHF দিয়ে LLM alignment।