AutoResearch: The Complete Guide to Karpathy's Autonomous AI Research Agent

AutoResearch lets AI agents run LLM research experiments autonomously overnight. The agent edits train.py, trains for 5 minutes, checks if results improved, keeps or discards, repeats. You wake up to a log of experiments and a better model. By Andrej Karpathy. 13,100+ stars in 3 days.

AutoResearch on GitHub

What Is AutoResearch?

Give an AI agent a small but real LLM training setup and let it experiment autonomously. It modifies code, trains, evaluates, iterates — all night. ~12 experiments/hour, ~100 experiments while you sleep.

"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun... That era is long gone. This repo is the story of how it all began." — @karpathy, March 2026

Language: Python
License: MIT
Stars: 13,100+ ⭐ (in 3 days!)
Forks: 1,704
Contributors: 5
Author: Andrej Karpathy
Companion: nanochat

Only 3 Files

File	Who Edits	Purpose
`prepare.py`	Nobody	Fixed constants, data prep, dataloader, evaluation
`train.py`	The AI Agent	GPT model, optimizer (Muon + AdamW), training loop
`program.md`	The Human	Agent instructions — "research org code"

The human writes program.md (the research program). The AI agent modifies train.py (the code). Everything is fair game: architecture, hyperparameters, optimizer, batch size.

How It Works

1. Agent reads program.md
2. Agent modifies train.py
3. Training runs for exactly 5 minutes
4. Metric: val_bpb (validation bits per byte)
5. If improved → keep changes
6. If not → discard changes
7. Repeat (12 experiments/hour, ~100 overnight)

Fixed 5-Minute Time Budget

Every experiment runs for exactly 5 minutes (wall clock). This means:

Experiments are directly comparable regardless of what the agent changes
AutoResearch finds the most optimal model for your platform in that budget
~12 experiments/hour, ~100 overnight

The Metric: val_bpb

Validation bits per byte — lower is better. Vocab-size-independent, so architectural changes are fairly compared.

Quick Start

# 1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Install dependencies
uv sync
# 3. Download data and train tokenizer (~2 min)
uv run prepare.py
# 4. Run a single training experiment (~5 min)
uv run train.py

Then spin up Claude Code / Codex / any agent and prompt:

Hi have a look at program.md and let's kick off a new experiment!

Design Choices

Single file to modify — Agent only touches train.py. Diffs are reviewable.
Fixed time budget — Always 5 minutes. Comparable results.
Self-contained — No distributed training, no complex configs. One GPU, one file, one metric.

Tuning for Smaller Hardware

For Macbooks and small GPUs, Karpathy recommends:

Parameter	Default	Small Compute
Dataset	FineWeb	TinyStories
`vocab_size`	8192	4096→256
`MAX_SEQ_LEN`	Large	Down to 256
`DEPTH`	8	4 or less
`WINDOW_PATTERN`	"SSSL"	"L"
`TOTAL_BATCH_SIZE`	Large	2^14 (~16K)

Notable Forks

Fork	Platform
autoresearch-macos	macOS
autoresearch-mlx	macOS (MLX)
autoresearch-win-rtx	Windows

AutoResearch vs Alternatives

Category: This is an autonomous AI research agent for LLM training.

Feature	AutoResearch	NanoChat	AISI Inspect	DSPy
Focus	Autonomous research agent	LLM training harness	AI eval framework	LM program optimizer
Stars	13.1K ⭐ (3 days!)	45.3K ⭐	~3K ⭐	~20K ⭐
Author	Karpathy	Karpathy	UK AISI	Stanford
AI Agent Edits Code	✅ train.py	❌ Human	❌	❌
Fixed Time Budget	✅ 5 min	Manual	N/A	N/A
Overnight Experiments	✅ ~100	❌	❌	❌
program.md Skill	✅ Human writes	❌	❌	❌
Single GPU	✅	✅ (or multi)	N/A	N/A
Metric	val_bpb	Multiple	Task-specific	Task-specific
Files to Edit	1 (train.py)	Many	Config	Programs

When to choose AutoResearch: You want AI to do your LLM research overnight — modify code, train, evaluate, iterate autonomously.

When to choose NanoChat: You want to manually train and chat with LLMs using a full pipeline.

When to choose AISI Inspect: You need a framework for evaluating AI model capabilities.

When to choose DSPy: You want to automatically optimize LM programs with prompt tuning.

Conclusion

AutoResearch is the beginning of autonomous AI research. One file the agent edits, one file the human programs, 5-minute fixed experiments, ~100 overnight iterations. It's "programming the program" — you write program.md to instruct the AI researcher, not train.py. With 13.1K stars in 3 days and Karpathy's signature minimalism, it represents a paradigm shift: AI agents doing the research you'd do yourself, but while you sleep.

Explore AutoResearch on GitHub

AutoResearch: The Complete Guide to Karpathy's Autonomous AI Research Agent

AutoResearch: The Complete Guide to Karpathy's Autonomous AI Research Agent

What Is AutoResearch?

Only 3 Files

How It Works

Fixed 5-Minute Time Budget

The Metric: val_bpb

Quick Start

Design Choices

Tuning for Smaller Hardware

Notable Forks

AutoResearch vs Alternatives

Conclusion

Resources

Tags

Claude Code Best Practice: The Complete Guide to Mastering Agentic Coding

Paperclip: The Complete Guide to Open-Source Orchestration for Zero-Human Companies

Crawlee Python: The Complete Guide to Web Scraping and Browser Automation