-
Notifications
You must be signed in to change notification settings - Fork 532
New record 146.8s 09/30/25 (-0.6s): CustomBatching, only update Adam Params every other step #136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New record 146.8s 09/30/25 (-0.6s): CustomBatching, only update Adam Params every other step #136
Conversation
Merge PR 118
…aining, improve skip connection gating, and enhance bfloat16 usage
You could easily decrease steps by 15-20 and cut off a whole second while still being under 3.28 loss! |
I am done testing for a while, but yes the next couple seconds should fall pretty quickly. |
|
@ClassicLarry I believe this record also introduced the cooldown phase for Muon. What was the motivation for it in this PR? |
I don't know how accurate this is, but I have a mental image of the model during training moving down the side of a snowy mountain, and the final low lr regime is the model committing to a local crevasse. When taking really small steps in this crevasse, I don't want to be influenced by old gradients, as I want to have the agility to move with the angle of the crevasse. That was the theoretical motivation. The second factor is that during end-of-training optimizations is really easy to tune because you can save a couple model runs right near the end, load them into memory and then just tune a parameter on 50 steps. So I tuned this on an older Colab model and saw 0.001 loss improvement. Afterwards, I added the much more radical changes on batch size and did not retune the momentum cooldown. |
This submission reflects all recent WR changes up to PR#134.
The main contribution of this PR is to introduce the concept of variable batch size by parameter group, by having different gradient accumulation strategies for different param groups. I drop the batch size by 1/3, and then allow gradients to accumulate in Adam params over 2 steps before updating, whereas Muon params update every step.
Also, for a minor improvement, I add a cooldown to the momentum term in Muon for the last 50 steps.
The mean loss of 3.2765 is well below the 3.28 cutoff. There may be some hyperparameter tuning that can be performed, even as simple as decreasing the step count. I only tried 1 batch size, 1 cooldown frac, and only minimally looked at parameter learning rates or momentum terms. There is a consistent hiccup of 300ms around step 2385 that may be fixable. Overall I encourage hyperparameter tuning as that was left out of scope for this PR. This investigation into batch size was partially motivated by conversations with @varunneal.
Timing and Validation
retiming prior record: 147.4: [147.451, 147.336, 147.508]