autoresearch
|
97dda8577f
|
init scale 0.7 to 0.68
|
2026-03-08 12:44:03 +00:00 |
|
autoresearch
|
f5979a7464
|
init scale 0.7x
|
2026-03-08 10:14:06 +00:00 |
|
autoresearch
|
41d50a8539
|
reduce transformer init scale by 0.8x
|
2026-03-08 10:02:16 +00:00 |
|
autoresearch
|
264a05b48f
|
add small WD 0.01 to lm_head (AdamW)
|
2026-03-08 07:23:33 +00:00 |
|
autoresearch
|
7f63c17076
|
unembedding LR 0.006 to 0.005
|
2026-03-08 07:05:48 +00:00 |
|
autoresearch
|
a7aa309dd4
|
muon momentum warmup 300 to 200 steps
|
2026-03-08 05:29:56 +00:00 |
|
autoresearch
|
aa8f408f39
|
unembedding LR 0.004 to 0.006
|
2026-03-08 04:48:14 +00:00 |
|
autoresearch
|
772dada6cc
|
FINAL_LR_FRAC 0.0 to 0.05 (small LR floor)
|
2026-03-08 04:36:38 +00:00 |
|
autoresearch
|
0640555a1f
|
x0_lambda init 0.1 to 0.05
|
2026-03-08 04:30:50 +00:00 |
|
autoresearch
|
7d047e42f4
|
embedding LR 0.6 to 0.8
|
2026-03-08 04:19:15 +00:00 |
|
autoresearch
|
59e9dd9aab
|
RoPE base frequency 10K to 200K
|
2026-03-08 04:13:30 +00:00 |
|
autoresearch
|
7da0b673a1
|
short window 1/8 context (256 tokens instead of 1024)
|
2026-03-08 04:07:44 +00:00 |
|
autoresearch
|
8363d52e8d
|
SSSSL window pattern (5:1 short:long ratio)
|
2026-03-08 04:01:58 +00:00 |
|
autoresearch
|
4e6697f68d
|
warmdown 0.5 to 0.7 (more cooldown)
|
2026-03-08 03:56:11 +00:00 |
|
autoresearch
|
7f2a65c9a5
|
depth 9 aspect_ratio 57 (extra layer, dim ~512)
|
2026-03-08 03:44:34 +00:00 |
|
autoresearch
|
bea057bc08
|
halve batch size 524K to 262K for more steps in 5 min
|
2026-03-08 03:38:47 +00:00 |
|
Andrej Karpathy
|
8a5c4869bd
|
bunch of small changes to docs and files, and a teaser figure with a blooper :)
|
2026-03-07 19:00:04 +00:00 |
|
Marcin Bogdanski
|
17b480aa65
|
add fallback FA3 kernel for non-Hopper GPUs
|
2026-03-07 01:31:48 +00:00 |
|
Andrej Karpathy
|
b11d6f283f
|
initial commit
|
2026-03-06 21:58:52 +00:00 |
|