๐Ÿ RL Snake

GitHub

A full journey through modern reinforcement learning on Snake โ€” from tabular Q-learning through ConvNets โ€” with interactive notebooks, real training logs, and honest empirical results.

PPO agent mastering 10ร—10 Snake after curriculum training

Final PPO agent after the 5ร—5 โ†’ 10ร—10 curriculum

Interactive Notebooks

Curiosity Killed the Snake

Does the Intrinsic Curiosity Module (Pathak et al., 2017) actually help? A 2ร—3ร—2ร—3 ablation across DQN, PPO, three reward modes, two boards, three seeds. Includes the death-oversampling trap and the terminal-mask fix.

Open in marimo

ConvNet vs Feature DQN

Can raw grid pixels beat 24 hand-crafted features? A 30-trial W&B hyperparameter sweep, architecture walkthrough, and RND exploration ablation. ConvNet v3 reaches mean 12.32 on 10ร—10 vs Feature DQN's 5.50.

Open in marimo

Results

10ร—10 board ยท 10k games ยท 3 seeds

ConvNet v3
12.32
ConvNet+RND
11.96
Feature DQN
5.50
ConvNet v1
1.49

16ร—16 board ยท 16k games

ConvNet v3
10.83
Feature DQN
5.92
ConvNet v1
0.20

Experiment History

1 ยท Zero to Hero โ€” the Curriculum Journey

Watch the gameplay replay โ†’

PhaseAlgorithmBoardMax ScoreNotes
0Tabular Q-Learning5ร—524 โœ“Perfect on small board
1Double Q-Learning5ร—524Stable convergence
2Behaviour Cloning + PPO8ร—846Human demos bootstrap policy
3PPO + Curriculum (5โ†’10ร—10)10ร—1064Progressive board growth

2 ยท DQN Ablations (8ร—8, 4k games, n=3)

AlgorithmMean ScoreNotes
DQN + Dueling + 3-step4.61Pilot A baseline
DQN + PER4.48Pilot B โ€” prioritised replay
DQN + PER + ICM (unmasked)4.46Pilot C โ€” ICM poisons terminal transitions
DQN + PER + ICM (terminal mask)4.60Pilot D โ€” fix recovers baseline

3 ยท ICM Investigation (10ร—10, 5k games, n=3)

AlgorithmReward modeMean ScoreNotes
PPOdense0.12PPO struggles with shaped reward
PPO + ICMdense0.14ICM doesn't rescue PPO + dense
PPOpure sparse5.36PPO shines without dense shaping
PPO + ICMpure sparse6.63+24% โ€” ICM's real contribution

4 ยท Curiosity Summary

QuestionAnswer
Does ICM help DQN with dense reward?No โ€” |ฮ”| < 0.05
Does ICM rescue DQN under sparse reward?No โ€” |ฮ”| < 0.15
Does dense shaping matter for DQN?Yes โ€” ~15% score drop without it
Does ICM help PPO under sparse reward?Yes โ€” +24% on pure_sparse 10ร—10
Does ICM increase state coverage?No โ€” both saturate at ~99%

5 ยท ConvNet Sweep Findings

Running Locally

git clone https://github.com/Saheb/rl-snake.git
cd rl-snake
uv sync
uv run marimo edit notebooks/curiosity.py
uv run marimo edit notebooks/convnet.py