A full journey through modern reinforcement learning on Snake โ from tabular Q-learning through ConvNets โ with interactive notebooks, real training logs, and honest empirical results.
Final PPO agent after the 5ร5 โ 10ร10 curriculum
Interactive Notebooks
Curiosity Killed the Snake
Does the Intrinsic Curiosity Module (Pathak et al., 2017) actually help? A 2ร3ร2ร3 ablation across DQN, PPO, three reward modes, two boards, three seeds. Includes the death-oversampling trap and the terminal-mask fix.
ConvNet vs Feature DQN
Can raw grid pixels beat 24 hand-crafted features? A 30-trial W&B hyperparameter sweep, architecture walkthrough, and RND exploration ablation. ConvNet v3 reaches mean 12.32 on 10ร10 vs Feature DQN's 5.50.
Results
10ร10 board ยท 10k games ยท 3 seeds
16ร16 board ยท 16k games
Experiment History
1 ยท Zero to Hero โ the Curriculum Journey
| Phase | Algorithm | Board | Max Score | Notes |
|---|---|---|---|---|
| 0 | Tabular Q-Learning | 5ร5 | 24 โ | Perfect on small board |
| 1 | Double Q-Learning | 5ร5 | 24 | Stable convergence |
| 2 | Behaviour Cloning + PPO | 8ร8 | 46 | Human demos bootstrap policy |
| 3 | PPO + Curriculum (5โ10ร10) | 10ร10 | 64 | Progressive board growth |
2 ยท DQN Ablations (8ร8, 4k games, n=3)
| Algorithm | Mean Score | Notes |
|---|---|---|
| DQN + Dueling + 3-step | 4.61 | Pilot A baseline |
| DQN + PER | 4.48 | Pilot B โ prioritised replay |
| DQN + PER + ICM (unmasked) | 4.46 | Pilot C โ ICM poisons terminal transitions |
| DQN + PER + ICM (terminal mask) | 4.60 | Pilot D โ fix recovers baseline |
3 ยท ICM Investigation (10ร10, 5k games, n=3)
| Algorithm | Reward mode | Mean Score | Notes |
|---|---|---|---|
| PPO | dense | 0.12 | PPO struggles with shaped reward |
| PPO + ICM | dense | 0.14 | ICM doesn't rescue PPO + dense |
| PPO | pure sparse | 5.36 | PPO shines without dense shaping |
| PPO + ICM | pure sparse | 6.63 | +24% โ ICM's real contribution |
4 ยท Curiosity Summary
| Question | Answer |
|---|---|
| Does ICM help DQN with dense reward? | No โ |ฮ| < 0.05 |
| Does ICM rescue DQN under sparse reward? | No โ |ฮ| < 0.15 |
| Does dense shaping matter for DQN? | Yes โ ~15% score drop without it |
| Does ICM help PPO under sparse reward? | Yes โ +24% on pure_sparse 10ร10 |
| Does ICM increase state coverage? | No โ both saturate at ~99% |
5 ยท ConvNet Sweep Findings
- n_steps=1 universally better with PER โ appeared in every top sweep config; multi-step returns hurt with prioritised replay
- ch1=64 / ch2=128 โ larger second conv layer consistently wins over ch1=32/ch2=64
- epsilon_end=0.01, gamma=0.995 โ exploit harder, value distant food more
- RND null result โ saturation on sparse boards; curiosity helps weak baselines, not well-tuned agents
Running Locally
git clone https://github.com/Saheb/rl-snake.git
cd rl-snake
uv sync
uv run marimo edit notebooks/curiosity.py
uv run marimo edit notebooks/convnet.py