Skip to content

Conversation

@acutkosky
Copy link

Update Smoothing + Snoo

Combining #128 and #129, decreases iters to 5590 (-20 from #129). Also, the p-values are much more robust - one could likely decrease by 10-20 more iters, but it would start requiring more data to get high confidence.

#129 smooths out the Muon updates:

muon_update = NS(EMA(grads))
final_update = EMA(muon_update)

Here, unlike in #129, we use a constant ema coefficient of 0.2.

#128 applies a lookahead step to the updates: run an inner optimizer for K iterations, and treat the parameter displacement as a “gradient” for an inner SGD optimizer. Note that if K=1 and the SGD optimizer does not employ Nesterov momentum, then I think the two are equivalent - with the exception that #128 works on every parameter rather than just the Muon parameters.

Here, we simply use the smoothed Muon updates of #129 as the inner optimizer for #128, in addition to importing some learning rate tuning from #129.

Overall, the total iterations can be decreased to 5590 (from 5610 in #129 or 5640 in #128). I also was more stringent with the p-value criterion, so that it’s likely there is a bit more “slack” in this submission than in either #128 or #129.

Baseline Stats (80 runs each)

There is substantial variance in the p-values for these runs, so I ran 80 runs of each baseline, and then created 1000 bootstrap samples of size 40 to estimate the fraction of times the p-value was less than 0.01 if you were to only to 40 runs.

#129:

--- Val Loss Stats ---
mean: 	2.919815
std:  	0.000751
val_loss t-test p=0.015461 (small means <2.92)
--- Bootstrap p-value analysis --- (1000 samples of size 40)
Percentage of p-values below 0.01: 21.00%
--- Training Time Stats ---
train time (minutes): mean=23.4811, std=0.1983

#128 (here I use the current configuration with 5640 steps)

--- Val Loss Stats ---
mean: 	2.919738
std:  	0.000884
val_loss t-test p=0.004818 (small means <2.92)

--- Bootstrap p-value analysis --- (1000 samples of size 40)
Percentage of p-values below 0.01: 32.10%

--- Training Time Stats ---
train time 99% confidence interval: (23.5856 - 23.6986)

So, from this we see that there both of these runs have a reasonable chance of hitting the required p-value in 40 samples. The “mean p-value” for the bootstrap analysis is very high because the mean is disproportionately favoring larger numbers.

This PR

I ran 160 runs for the new changes in order to have more data for the bootstrap estimate, and from these created 1000 bootstrapped samples of size 40 each to get an idea for the variance in the p-value calculation. Over these samples, we see:

--- Val Loss stats over all 160 runs --- 
mean: 	2.919547
std:  	0.000798
val loss 99% confidence interval: (2.919383 - 2.919712)
val_loss t-test p=0.000000 (small means <2.92)

--- Bootstrap p-value analysis (1000 samples of size 40 each) ---
Percentage of p-values below 0.01: 85.40%

--- Training Time Stats ---
train time (minutes): mean=23.4283, std=0.1866

So it seems a much higher chance of getting a low p-value with 40 runs.

See the attached readme for the full list of 160 validation losses and pvals. I also ran some ablations provided in the readme:

  • Decrease iters to 5580: this one still hits the baseline, but only ~60% chance of a 40-run sample having a p-value < 0.01
  • Remove the EMA on Muon, but keep the better learning rate tuning from Medium track WR, ema on top of muon, includes PR124. #129, and increase iters to 5600 to make up for faster iteration count: this doesn't hit the target.

I also tried tuning the cooldown fraction as suggested by @YouJiacheng in a comment on #129 (both with and without smoothing), but didn't see any gains. It's possible that using a more delicate learning rate schedule directly derived from the EMA might be better though; I didn't check this.

@ClassicLarry
Copy link
Collaborator

Appreciate the effort to make the p-value more accurate!

The short track tends to sit between 0.001 and 0.002 loss below the threshold, which is done intentionally so people can focus on tangible improvements instead of p-hacking. If I'm running a test and I can't get p<0.01 in under 10 runs, I'll typically throw out the idea. The medium track might benefit from continuing the direction of this PR and increasing the slack further.

snimu added a commit to snimu/modded-nanogpt that referenced this pull request Oct 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants