New medium track WR: combined #129 and #128 #137

acutkosky · 2025-09-30T17:59:11Z

Update Smoothing + Snoo

Combining #128 and #129, decreases iters to 5590 (-20 from #129). Also, the p-values are much more robust - one could likely decrease by 10-20 more iters, but it would start requiring more data to get high confidence.

#129 smooths out the Muon updates:

muon_update = NS(EMA(grads))
final_update = EMA(muon_update)

Here, unlike in #129, we use a constant ema coefficient of 0.2.

#128 applies a lookahead step to the updates: run an inner optimizer for K iterations, and treat the parameter displacement as a “gradient” for an inner SGD optimizer. Note that if K=1 and the SGD optimizer does not employ Nesterov momentum, then I think the two are equivalent - with the exception that #128 works on every parameter rather than just the Muon parameters.

Here, we simply use the smoothed Muon updates of #129 as the inner optimizer for #128, in addition to importing some learning rate tuning from #129.

Overall, the total iterations can be decreased to 5590 (from 5610 in #129 or 5640 in #128). I also was more stringent with the p-value criterion, so that it’s likely there is a bit more “slack” in this submission than in either #128 or #129.

Baseline Stats (80 runs each)

There is substantial variance in the p-values for these runs, so I ran 80 runs of each baseline, and then created 1000 bootstrap samples of size 40 to estimate the fraction of times the p-value was less than 0.01 if you were to only to 40 runs.

#129:

--- Val Loss Stats ---
mean: 	2.919815
std:  	0.000751
val_loss t-test p=0.015461 (small means <2.92)
--- Bootstrap p-value analysis --- (1000 samples of size 40)
Percentage of p-values below 0.01: 21.00%
--- Training Time Stats ---
train time (minutes): mean=23.4811, std=0.1983

#128 (here I use the current configuration with 5640 steps)

--- Val Loss Stats ---
mean: 	2.919738
std:  	0.000884
val_loss t-test p=0.004818 (small means <2.92)

--- Bootstrap p-value analysis --- (1000 samples of size 40)
Percentage of p-values below 0.01: 32.10%

--- Training Time Stats ---
train time 99% confidence interval: (23.5856 - 23.6986)

So, from this we see that there both of these runs have a reasonable chance of hitting the required p-value in 40 samples. The “mean p-value” for the bootstrap analysis is very high because the mean is disproportionately favoring larger numbers.

This PR

I ran 160 runs for the new changes in order to have more data for the bootstrap estimate, and from these created 1000 bootstrapped samples of size 40 each to get an idea for the variance in the p-value calculation. Over these samples, we see:

--- Val Loss stats over all 160 runs --- 
mean: 	2.919547
std:  	0.000798
val loss 99% confidence interval: (2.919383 - 2.919712)
val_loss t-test p=0.000000 (small means <2.92)

--- Bootstrap p-value analysis (1000 samples of size 40 each) ---
Percentage of p-values below 0.01: 85.40%

--- Training Time Stats ---
train time (minutes): mean=23.4283, std=0.1866

So it seems a much higher chance of getting a low p-value with 40 runs.

See the attached readme for the full list of 160 validation losses and pvals. I also ran some ablations provided in the readme:

Decrease iters to 5580: this one still hits the baseline, but only ~60% chance of a 40-run sample having a p-value < 0.01
Remove the EMA on Muon, but keep the better learning rate tuning from Medium track WR, ema on top of muon, includes PR124. #129, and increase iters to 5600 to make up for faster iteration count: this doesn't hit the target.

I also tried tuning the cooldown fraction as suggested by @YouJiacheng in a comment on #129 (both with and without smoothing), but didn't see any gains. It's possible that using a more delicate learning rate schedule directly derived from the EMA might be better though; I didn't check this.

… 0.7

ClassicLarry · 2025-09-30T20:54:02Z

Appreciate the effort to make the p-value more accurate!

The short track tends to sit between 0.001 and 0.002 loss below the threshold, which is done intentionally so people can focus on tangible improvements instead of p-hacking. If I'm running a test and I can't get p<0.01 in under 10 runs, I'll typically throw out the idea. The medium track might benefit from continuing the direction of this PR and increasing the slack further.

dominikkallusky and others added 5 commits September 16, 2025 11:42

Snoo

073f335

increase steps to 5640

cb68de1

0.2 update smoothing, tuned learning rate, 5580 iter.

3dee05c

smoothing cons 0.2, 5590 iters, 0.01 final lr, 0.03 muon lr, cooldown…

321214b

… 0.7

add record

94879d4

snimu mentioned this pull request Oct 2, 2025

New Medium WR: -27 seconds via Layer Reuse; includes #137 #138

Closed

snimu added a commit to snimu/modded-nanogpt that referenced this pull request Oct 4, 2025

update train_gpt_medium.py to KellerJordan#137 (baseline)

8665ab3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New medium track WR: combined #129 and #128 #137

New medium track WR: combined #129 and #128 #137

Uh oh!

acutkosky commented Sep 30, 2025

Uh oh!

ClassicLarry commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

New medium track WR: combined #129 and #128 #137

Are you sure you want to change the base?

New medium track WR: combined #129 and #128 #137

Uh oh!

Conversation

acutkosky commented Sep 30, 2025

Update Smoothing + Snoo

Baseline Stats (80 runs each)

This PR

Uh oh!

ClassicLarry commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants