Python Built 2024

LLM Pokémon Red

A vision-language model plays a 1996 game with no hints and its own memory

The Problem

Standard LLM benchmarks — MMLU, HumanEval, GSM8K — test isolated capabilities: knowledge retrieval, code generation, math reasoning. They don’t test whether a model can reason across time, maintain context about state that changes frame-by-frame, and plan ahead when the environment doesn’t give you explicit affordances.

Pokémon Red is a 1996 Game Boy RPG. No tooltips. No accessibility helpers. Just pixels and game logic that the model has to infer from visual history.

The question: can a vision model actually play it?

What We Built

A fully autonomous agent loop built on PyBoy (Game Boy emulator) + Claude’s vision API:

Frame capture — the emulator renders each frame; a PIL pipeline upscales 160×144 → 480×432 and boosts contrast for the vision model
Notepad memory — the model maintains a markdown file (memory.md) where it records map locations, items collected, blocked paths, and failed strategies
Tool calling loop — the model receives the frame and its own notepad, then calls tools like press_button, read_dialog, update_memory
Curriculum planner — a lightweight state machine tracks high-level goals (get starter → beat Brock → reach Cerulean City) and injects them into the prompt

How It Works

The image enhancement pipeline is the first thing that made a real difference:

  
def enhance_frame(raw_frame: PIL.Image) -> PIL.Image: # 3x upscale with nearest-neighbor to preserve pixel art  frame = raw_frame.resize( (raw_frame.width * 3, raw_frame.height * 3), PIL.Image.NEAREST ) # Boost contrast for the vision model  enhancer = ImageEnhance.Contrast(frame) frame = enhancer.enhance(1.8) return frame 

The model’s self-maintained notepad is the second critical piece. Rather than relying on conversation history (which would blow the context window after 20 minutes), the model writes structured updates:

  
## Current location Route 1, heading north. Last known position: just south of Viridian City entrance. ## Inventory - Potion x2 - Pokéball x5 - Pokedex ## Blocked paths - Dark cave north of town: need HM Flash (not obtained) ## Failed strategies - Trying to use Bulbasaur against Brock's Onix: rock is resistant to grass. Need Pidgey. 

The Outcome

7 gym badges obtained autonomously

3 spatial failure modes identified

0 human interventions

The benchmark revealed specific, reproducible failure modes that standard datasets don’t surface: the model consistently misjudges relative position when exiting buildings (the screen transitions don’t give clear cardinal direction cues), struggles with inventory management across long sessions, and occasionally loops in areas with visually similar tiles.

These failure modes were documented and fed back into evaluation work. The project showed that long-horizon spatial reasoning — not just perception — is the hard problem for current vision models.

LLM Pokémon Red

The Problem

What We Built

How It Works

The Outcome

Trending Tags