AutoResearch: The Complete Guide to Karpathy's Autonomous AI Research Agent
AutoResearch lets AI agents run LLM research experiments autonomously overnight. The agent edits train.py, trains for 5 minutes, checks if results improved, keeps or discards, repeats. You wake up to a log of experiments and a better model. By Andrej Karpathy. 13,100+ stars in 3 days.
What Is AutoResearch?
Give an AI agent a small but real LLM training setup and let it experiment autonomously. It modifies code, trains, evaluates, iterates — all night. ~12 experiments/hour, ~100 experiments while you sleep.
"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun... That era is long gone. This repo is the story of how it all began." — @karpathy, March 2026
- Language: Python
- License: MIT
- Stars: 13,100+ ⭐ (in 3 days!)
- Forks: 1,704
- Contributors: 5
- Author: Andrej Karpathy
- Companion: nanochat
Only 3 Files
| File | Who Edits | Purpose |
|---|---|---|
prepare.py | Nobody | Fixed constants, data prep, dataloader, evaluation |
train.py | The AI Agent | GPT model, optimizer (Muon + AdamW), training loop |
program.md | The Human | Agent instructions — "research org code" |
The human writes program.md (the research program). The AI agent modifies train.py (the code). Everything is fair game: architecture, hyperparameters, optimizer, batch size.
How It Works
1. Agent reads program.md
2. Agent modifies train.py
3. Training runs for exactly 5 minutes
4. Metric: val_bpb (validation bits per byte)
5. If improved → keep changes
6. If not → discard changes
7. Repeat (12 experiments/hour, ~100 overnight)
Fixed 5-Minute Time Budget
Every experiment runs for exactly 5 minutes (wall clock). This means:
- Experiments are directly comparable regardless of what the agent changes
- AutoResearch finds the most optimal model for your platform in that budget
- ~12 experiments/hour, ~100 overnight
The Metric: val_bpb
Validation bits per byte — lower is better. Vocab-size-independent, so architectural changes are fairly compared.
Quick Start
# 1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Install dependencies
uv sync
# 3. Download data and train tokenizer (~2 min)
uv run prepare.py
# 4. Run a single training experiment (~5 min)
uv run train.py
Then spin up Claude Code / Codex / any agent and prompt:
Hi have a look at program.md and let's kick off a new experiment!
Design Choices
- Single file to modify — Agent only touches
train.py. Diffs are reviewable. - Fixed time budget — Always 5 minutes. Comparable results.
- Self-contained — No distributed training, no complex configs. One GPU, one file, one metric.
Tuning for Smaller Hardware
For Macbooks and small GPUs, Karpathy recommends:
| Parameter | Default | Small Compute |
|---|---|---|
| Dataset | FineWeb | TinyStories |
vocab_size | 8192 | 4096→256 |
MAX_SEQ_LEN | Large | Down to 256 |
DEPTH | 8 | 4 or less |
WINDOW_PATTERN | "SSSL" | "L" |
TOTAL_BATCH_SIZE | Large | 2^14 (~16K) |
Notable Forks
| Fork | Platform |
|---|---|
| autoresearch-macos | macOS |
| autoresearch-mlx | macOS (MLX) |
| autoresearch-win-rtx | Windows |
AutoResearch vs Alternatives
Category: This is an autonomous AI research agent for LLM training.
| Feature | AutoResearch | NanoChat | AISI Inspect | DSPy |
|---|---|---|---|---|
| Focus | Autonomous research agent | LLM training harness | AI eval framework | LM program optimizer |
| Stars | 13.1K ⭐ (3 days!) | 45.3K ⭐ | ~3K ⭐ | ~20K ⭐ |
| Author | Karpathy | Karpathy | UK AISI | Stanford |
| AI Agent Edits Code | ✅ train.py | ❌ Human | ❌ | ❌ |
| Fixed Time Budget | ✅ 5 min | Manual | N/A | N/A |
| Overnight Experiments | ✅ ~100 | ❌ | ❌ | ❌ |
| program.md Skill | ✅ Human writes | ❌ | ❌ | ❌ |
| Single GPU | ✅ | ✅ (or multi) | N/A | N/A |
| Metric | val_bpb | Multiple | Task-specific | Task-specific |
| Files to Edit | 1 (train.py) | Many | Config | Programs |
When to choose AutoResearch: You want AI to do your LLM research overnight — modify code, train, evaluate, iterate autonomously.
When to choose NanoChat: You want to manually train and chat with LLMs using a full pipeline.
When to choose AISI Inspect: You need a framework for evaluating AI model capabilities.
When to choose DSPy: You want to automatically optimize LM programs with prompt tuning.
Conclusion
AutoResearch is the beginning of autonomous AI research. One file the agent edits, one file the human programs, 5-minute fixed experiments, ~100 overnight iterations. It's "programming the program" — you write program.md to instruct the AI researcher, not train.py. With 13.1K stars in 3 days and Karpathy's signature minimalism, it represents a paradigm shift: AI agents doing the research you'd do yourself, but while you sleep.
Explore AutoResearch on GitHub
