Skip to content

Conversation

@ClassicLarry
Copy link
Collaborator

This submission reflects all recent WR changes up to PR#134.

The main contribution of this PR is to introduce the concept of variable batch size by parameter group, by having different gradient accumulation strategies for different param groups. I drop the batch size by 1/3, and then allow gradients to accumulate in Adam params over 2 steps before updating, whereas Muon params update every step.
Also, for a minor improvement, I add a cooldown to the momentum term in Muon for the last 50 steps.

  • num_iterations: 1630->2380
  • cooldown_frac: 0.5->0.4
  • adam beta1: 0.8->0.7

The mean loss of 3.2765 is well below the 3.28 cutoff. There may be some hyperparameter tuning that can be performed, even as simple as decreasing the step count. I only tried 1 batch size, 1 cooldown frac, and only minimally looked at parameter learning rates or momentum terms. There is a consistent hiccup of 300ms around step 2385 that may be fixable. Overall I encourage hyperparameter tuning as that was left out of scope for this PR. This investigation into batch size was partially motivated by conversations with @varunneal.

# only step Adam every other step
if step%2==0:
    optimizer2.step()
    optimizer2.zero_grad(set_to_none=True)
else:
    for opt in optimizers:
        opt.step()
    # null the gradients
    model.zero_grad(set_to_none=True)

Timing and Validation

import scipy.stats
import torch

losses = [3.2745, 3.2747, 3.2771, 3.2794, 3.2767]
times = [146.933, 146.893, 146.943, 146.553, 146.739]

print("p=%.4f" % scipy.stats.ttest_1samp(losses, 3.28, alternative="less").pvalue)
# p=0.0086

print("losses:", torch.std_mean(torch.tensor(losses)))
# losses: (tensor(0.0020), tensor(3.2765))

print("time:", torch.std_mean(torch.tensor(times)))
# time: (tensor(0.1664), tensor(146.8122))

retiming prior record: 147.4: [147.451, 147.336, 147.508]

@Gusarich
Copy link
Contributor

The mean loss of 3.2765 is well below the 3.28 cutoff.

You could easily decrease steps by 15-20 and cut off a whole second while still being under 3.28 loss!

@ClassicLarry
Copy link
Collaborator Author

The mean loss of 3.2765 is well below the 3.28 cutoff.

You could easily decrease steps by 15-20 and cut off a whole second while still being under 3.28 loss!

I am done testing for a while, but yes the next couple seconds should fall pretty quickly.

Gusarich added a commit to Gusarich/modded-nanogpt that referenced this pull request Sep 30, 2025
@ClassicLarry ClassicLarry merged commit ebf2dfc into KellerJordan:master Oct 15, 2025
@varunneal
Copy link
Contributor

@ClassicLarry I believe this record also introduced the cooldown phase for Muon. What was the motivation for it in this PR?

@ClassicLarry
Copy link
Collaborator Author

ClassicLarry commented Nov 15, 2025

@ClassicLarry I believe this record also introduced the cooldown phase for Muon. What was the motivation for it in this PR?

I don't know how accurate this is, but I have a mental image of the model during training moving down the side of a snowy mountain, and the final low lr regime is the model committing to a local crevasse. When taking really small steps in this crevasse, I don't want to be influenced by old gradients, as I want to have the agility to move with the angle of the crevasse. That was the theoretical motivation.

The second factor is that during end-of-training optimizations is really easy to tune because you can save a couple model runs right near the end, load them into memory and then just tune a parameter on 50 steps.

So I tuned this on an older Colab model and saw 0.001 loss improvement. Afterwards, I added the much more radical changes on batch size and did not retune the momentum cooldown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants