13 Commits

Author SHA1 Message Date
Andrej 32a1460f62 Merge pull request #301 from indianspeedster/master
add AMD ROCm fork to notable forks section
2026-03-16 11:43:26 -07:00
indianspeedster 513fe6fcee add AMD ROCm fork to notable forks section 2026-03-16 11:28:48 -07:00
Andrej c2450add72 Guard against infinite loop when no training shards exist, fix README typo 2026-03-10 22:32:17 -07:00
Andrej 0be1e4fdf9 fix NaN loss not caught by fast-fail check 2026-03-10 22:31:43 -07:00
Contributor ebf357841b fix(train): make NaN fast-fail check explicit 2026-03-11 04:28:08 +00:00
Hugh Brown 09ebea439d Guard against infinite loop when no training shards exist, fix README typo
Add assertion after filtering val_path from parquet_paths for the "train"
split so an empty list fails fast instead of spinning in a silent infinite
loop. Also remove stray article "a" in README ("a three files" → "three
files").

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 21:34:40 -06:00
Andrej c12eef778e Include beginner's guide to neural networks
Added a resource link for beginners in neural networks.
2026-03-09 16:00:55 -07:00
haosenwang1018 b5ba8ac00d fix NaN loss not caught by fast-fail check
`train_loss_f > 100` silently passes on NaN because IEEE 754 NaN
comparisons always return False. When an agent experiment produces
NaN (e.g. from an aggressive LR change), the run wastes the full
5-minute budget instead of failing fast.

`not (x <= 100)` catches both >100 and NaN with no added complexity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 23:51:02 +08:00
Andrej Karpathy 068d93da75 clarify that results.tsv should not be committed, leave untracked 2026-03-09 05:11:07 +00:00
Andrej Karpathy c92bee55eb some docs on what to play with to make autoresearch better on smaller computers 2026-03-09 04:49:15 +00:00
Andrej Karpathy 2224cd7cae reshuffle readme a bit and link to tiny stories for apple silicon guidance 2026-03-08 23:25:53 +00:00
Andrej Karpathy f16ece488f clarification to baseline run instruction, there was some language from a previous version that wasn't fully cleaned up 2026-03-08 23:16:30 +00:00
Andrej Karpathy 9264224a3c add notable fork mlx 2026-03-08 17:06:29 +00:00
5 changed files with 32 additions and 8 deletions
+3
View File
@@ -18,3 +18,6 @@ AGENTS.md
# Experimental code/artifacts
dev/
# Results file
results.tsv
+23 -4
View File
@@ -8,7 +8,7 @@ The idea: give an AI agent a small but real LLM training setup and let it experi
## How it works
The repo is deliberately kept small and only really has a three files that matter:
The repo is deliberately kept small and only really has three files that matter:
- **`prepare.py`** — fixed constants, one-time data prep (downloads training data, trains a BPE tokenizer), and runtime utilities (dataloader, evaluation). Not modified.
- **`train.py`** — the single file the agent edits. Contains the full GPT model, optimizer (Muon + AdamW), and training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc. **This file is edited and iterated on by the agent**.
@@ -16,6 +16,8 @@ The repo is deliberately kept small and only really has a three files that matte
By design, training runs for a **fixed 5-minute time budget** (wall clock, excluding startup/compilation), regardless of the details of your compute. The metric is **val_bpb** (validation bits per byte) — lower is better, and vocab-size-independent so architectural changes are fairly compared.
If you are new to neural networks, this ["Dummy's Guide"](https://x.com/hooeem/status/2030720614752039185) looks pretty good for a lot more context.
## Quick start
**Requirements:** A single NVIDIA GPU (tested on H100), Python 3.10+, [uv](https://docs.astral.sh/uv/).
@@ -37,8 +39,6 @@ uv run train.py
If the above commands all work ok, your setup is working and you can go into autonomous research mode.
**Platforms support**. This code currently requires that you have a single NVIDIA GPU. In principle it is quite possible to support CPU, MPS and other platforms but this would also bloat the code. I'm not 100% sure that I want to take this on personally right now. The code is just a demonstration and I don't know how much I'll support it going forward. People can reference (or have their agents reference) the full/parent nanochat repository that has wider platform support and shows the various solutions (e.g. a Flash Attention 3 kernels fallback implementation, generic device support, autodetection, etc.), feel free to create forks or discussions for other platforms and I'm happy to link to them here in the README in some new notable forks section or etc.
## Running the agent
Simply spin up your Claude/Codex or whatever you want in this repo (and disable all permissions), then you can prompt something like:
@@ -64,9 +64,28 @@ pyproject.toml — dependencies
- **Fixed time budget.** Training always runs for exactly 5 minutes, regardless of your specific platform. This means you can expect approx 12 experiments/hour and approx 100 experiments while you sleep. There are two upsides of this design decision. First, this makes experiments directly comparable regardless of what the agent changes (model size, batch size, architecture, etc). Second, this means that autoresearch will find the most optimal model for your platform in that time budget. The downside is that your runs (and results) become not comparable to other people running on other compute platforms.
- **Self-contained.** No external dependencies beyond PyTorch and a few small packages. No distributed training, no complex configs. One GPU, one file, one metric.
## Platform support
This code currently requires that you have a single NVIDIA GPU. In principle it is quite possible to support CPU, MPS and other platforms but this would also bloat the code. I'm not 100% sure that I want to take this on personally right now. People can reference (or have their agents reference) the full/parent nanochat repository that has wider platform support and shows the various solutions (e.g. a Flash Attention 3 kernels fallback implementation, generic device support, autodetection, etc.), feel free to create forks or discussions for other platforms and I'm happy to link to them here in the README in some new notable forks section or etc.
Seeing as there seems to be a lot of interest in tinkering with autoresearch on much smaller compute platforms than an H100, a few extra words. If you're going to try running autoresearch on smaller computers (Macbooks etc.), I'd recommend one of the forks below. On top of this, here are some recommendations for how to tune the defaults for much smaller models for aspiring forks:
1. To get half-decent results I'd use a dataset with a lot less entropy, e.g. this [TinyStories dataset](https://huggingface.co/datasets/karpathy/tinystories-gpt4-clean). These are GPT-4 generated short stories. Because the data is a lot narrower in scope, you will see reasonable results with a lot smaller models (if you try to sample from them after training).
2. You might experiment with decreasing `vocab_size`, e.g. from 8192 down to 4096, 2048, 1024, or even - simply byte-level tokenizer with 256 possibly bytes after utf-8 encoding.
3. In `prepare.py`, you'll want to lower `MAX_SEQ_LEN` a lot, depending on the computer even down to 256 etc. As you lower `MAX_SEQ_LEN`, you may want to experiment with increasing `DEVICE_BATCH_SIZE` in `train.py` slightly to compensate. The number of tokens per fwd/bwd pass is the product of these two.
4. Also in `prepare.py`, you'll want to decrease `EVAL_TOKENS` so that your validation loss is evaluated on a lot less data.
5. In `train.py`, the primary single knob that controls model complexity is the `DEPTH` (default 8, here). A lot of variables are just functions of this, so e.g. lower it down to e.g. 4.
6. You'll want to most likely use `WINDOW_PATTERN` of just "L", because "SSSL" uses alternating banded attention pattern that may be very inefficient for you. Try it.
7. You'll want to lower `TOTAL_BATCH_SIZE` a lot, but keep it powers of 2, e.g. down to `2**14` (~16K) or so even, hard to tell.
I think these would be the reasonable hyperparameters to play with. Ask your favorite coding agent for help and copy paste them this guide, as well as the full source code.
## Notable forks
- [miolini/autoresearch-macos](https://github.com/miolini/autoresearch-macos)
- [miolini/autoresearch-macos](https://github.com/miolini/autoresearch-macos) (MacOS)
- [trevin-creator/autoresearch-mlx](https://github.com/trevin-creator/autoresearch-mlx) (MacOS)
- [jsegov/autoresearch-win-rtx](https://github.com/jsegov/autoresearch-win-rtx) (Windows)
- [andyluo7/autoresearch](https://github.com/andyluo7/autoresearch) (AMD)
## License
+1
View File
@@ -258,6 +258,7 @@ def _document_batches(split, tokenizer_batch_size=128):
val_path = os.path.join(DATA_DIR, VAL_FILENAME)
if split == "train":
parquet_paths = [p for p in parquet_paths if p != val_path]
assert len(parquet_paths) > 0, "No training shards found."
else:
parquet_paths = [val_path]
epoch = 1
+2 -2
View File
@@ -13,7 +13,7 @@ To set up a new experiment, work with the user to:
- `prepare.py` — fixed constants, data prep, tokenizer, dataloader, evaluation. Do not modify.
- `train.py` — the file you modify. Model architecture, optimizer, training loop.
4. **Verify data exists**: Check that `~/.cache/autoresearch/` contains data shards and a tokenizer. If not, tell the human to run `uv run prepare.py`.
5. **Initialize results.tsv**: Create `results.tsv` with header row and baseline entry. The baseline results are already known from the output format section below (val_bpb: 0.997900, peak_vram_mb: 45060.2). Do NOT re-run the baseline — just record it.
5. **Initialize results.tsv**: Create `results.tsv` with just the header row. The baseline will be recorded after the first run.
6. **Confirm and go**: Confirm setup looks good.
Once you get confirmation, kick off the experimentation.
@@ -99,7 +99,7 @@ LOOP FOREVER:
4. Run the experiment: `uv run train.py > run.log 2>&1` (redirect everything — do NOT use tee or let output flood your context)
5. Read out the results: `grep "^val_bpb:\|^peak_vram_mb:" run.log`
6. If the grep output is empty, the run crashed. Run `tail -n 50 run.log` to read the Python stack trace and attempt a fix. If you can't get things to work after more than a few attempts, give up.
7. Record the results in the tsv
7. Record the results in the tsv (NOTE: do not commit the results.tsv file, leave it untracked by git)
8. If val_bpb improved (lower), you "advance" the branch, keeping the git commit
9. If val_bpb is equal or worse, you git reset back to where you started
+3 -2
View File
@@ -9,6 +9,7 @@ os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True"
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
import gc
import math
import time
from dataclasses import dataclass, asdict
@@ -565,8 +566,8 @@ while True:
train_loss_f = train_loss.item()
# Fast fail: abort if loss is exploding
if train_loss_f > 100:
# Fast fail: abort if loss is exploding or NaN
if math.isnan(train_loss_f) or train_loss_f > 100:
print("FAIL")
exit(1)