New medium track WR: combined #129 and #128 #137
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Update Smoothing + Snoo
Combining #128 and #129, decreases iters to 5590 (-20 from #129). Also, the p-values are much more robust - one could likely decrease by 10-20 more iters, but it would start requiring more data to get high confidence.
#129 smooths out the Muon updates:
Here, unlike in #129, we use a constant ema coefficient of 0.2.
#128 applies a lookahead step to the updates: run an inner optimizer for K iterations, and treat the parameter displacement as a “gradient” for an inner SGD optimizer. Note that if K=1 and the SGD optimizer does not employ Nesterov momentum, then I think the two are equivalent - with the exception that #128 works on every parameter rather than just the Muon parameters.
Here, we simply use the smoothed Muon updates of #129 as the inner optimizer for #128, in addition to importing some learning rate tuning from #129.
Overall, the total iterations can be decreased to 5590 (from 5610 in #129 or 5640 in #128). I also was more stringent with the p-value criterion, so that it’s likely there is a bit more “slack” in this submission than in either #128 or #129.
Baseline Stats (80 runs each)
There is substantial variance in the p-values for these runs, so I ran 80 runs of each baseline, and then created 1000 bootstrap samples of size 40 to estimate the fraction of times the p-value was less than 0.01 if you were to only to 40 runs.
#129:
#128 (here I use the current configuration with 5640 steps)
So, from this we see that there both of these runs have a reasonable chance of hitting the required p-value in 40 samples. The “mean p-value” for the bootstrap analysis is very high because the mean is disproportionately favoring larger numbers.
This PR
I ran 160 runs for the new changes in order to have more data for the bootstrap estimate, and from these created 1000 bootstrapped samples of size 40 each to get an idea for the variance in the p-value calculation. Over these samples, we see:
So it seems a much higher chance of getting a low p-value with 40 runs.
See the attached readme for the full list of 160 validation losses and pvals. I also ran some ablations provided in the readme:
I also tried tuning the cooldown fraction as suggested by @YouJiacheng in a comment on #129 (both with and without smoothing), but didn't see any gains. It's possible that using a more delicate learning rate schedule directly derived from the EMA might be better though; I didn't check this.