New record 146.8s 09/30/25 (-0.6s): CustomBatching, only update Adam Params every other step #136

ClassicLarry · 2025-09-29T22:56:12Z

This submission reflects all recent WR changes up to PR#134.

The main contribution of this PR is to introduce the concept of variable batch size by parameter group, by having different gradient accumulation strategies for different param groups. I drop the batch size by 1/3, and then allow gradients to accumulate in Adam params over 2 steps before updating, whereas Muon params update every step.
Also, for a minor improvement, I add a cooldown to the momentum term in Muon for the last 50 steps.

num_iterations: 1630->2380
cooldown_frac: 0.5->0.4
adam beta1: 0.8->0.7

The mean loss of 3.2765 is well below the 3.28 cutoff. There may be some hyperparameter tuning that can be performed, even as simple as decreasing the step count. I only tried 1 batch size, 1 cooldown frac, and only minimally looked at parameter learning rates or momentum terms. There is a consistent hiccup of 300ms around step 2385 that may be fixable. Overall I encourage hyperparameter tuning as that was left out of scope for this PR. This investigation into batch size was partially motivated by conversations with @varunneal.

# only step Adam every other step
if step%2==0:
    optimizer2.step()
    optimizer2.zero_grad(set_to_none=True)
else:
    for opt in optimizers:
        opt.step()
    # null the gradients
    model.zero_grad(set_to_none=True)

Timing and Validation

import scipy.stats
import torch

losses = [3.2745, 3.2747, 3.2771, 3.2794, 3.2767]
times = [146.933, 146.893, 146.943, 146.553, 146.739]

print("p=%.4f" % scipy.stats.ttest_1samp(losses, 3.28, alternative="less").pvalue)
# p=0.0086

print("losses:", torch.std_mean(torch.tensor(losses)))
# losses: (tensor(0.0020), tensor(3.2765))

print("time:", torch.std_mean(torch.tensor(times)))
# time: (tensor(0.1664), tensor(146.8122))

retiming prior record: 147.4: [147.451, 147.336, 147.508]

Merge PR 118

See README

…aining, improve skip connection gating, and enhance bfloat16 usage

Gusarich · 2025-09-30T09:42:49Z

The mean loss of 3.2765 is well below the 3.28 cutoff.

You could easily decrease steps by 15-20 and cut off a whole second while still being under 3.28 loss!

ClassicLarry · 2025-09-30T15:17:55Z

The mean loss of 3.2765 is well below the 3.28 cutoff.

You could easily decrease steps by 15-20 and cut off a whole second while still being under 3.28 loss!

I am done testing for a while, but yes the next couple seconds should fall pretty quickly.

varunneal · 2025-11-15T19:53:07Z

@ClassicLarry I believe this record also introduced the cooldown phase for Muon. What was the motivation for it in this PR?

ClassicLarry · 2025-11-15T20:03:36Z

@ClassicLarry I believe this record also introduced the cooldown phase for Muon. What was the motivation for it in this PR?

I don't know how accurate this is, but I have a mental image of the model during training moving down the side of a snowy mountain, and the final low lr regime is the model committing to a local crevasse. When taking really small steps in this crevasse, I don't want to be influenced by old gradients, as I want to have the agility to move with the angle of the crevasse. That was the theoretical motivation.

The second factor is that during end-of-training optimizations is really easy to tune because you can save a couple model runs right near the end, load them into memory and then just tune a parameter on 50 steps.

So I tuned this on an older Colab model and saw 0.001 loss improvement. Afterwards, I added the much more radical changes on batch size and did not retune the momentum cooldown.

ClassicLarry and others added 20 commits August 23, 2025 13:08

adding sparse attention gate

9d9dc96

Update README.md

12dfed7

Added FA3 record

84cd472

Merge pull request KellerJordan#1 from varunneal/master

393a478

Merge PR 118

FA3 with flash_attn_varlen_func (KellerJordan#1)

ad8cdc4

See README

Add skip_mlp_block code and results

05502bf

Fix senseless removing of 12 (non-existent) block mlp layer

091cc97

yarn

34ae835

Update ReadMe.md

4bfa3a5

.

eae03e6

New WR 1.25% better than PR KellerJordan#122: Optimize distributed tr…

a96fbbd

…aining, improve skip connection gating, and enhance bfloat16 usage

data threading and final layer window

e899481

cleanup

e8d03f2

add smear

d149ec4

dropattn

2fe38e2

MuonCustomSizing

7d7952d

cleanup

be7479a

Polar Express + HF Kernel for Flash attn

9350f4d

Correction to match logs

a45ac73

CustomBatching

218b14d

Gusarich added a commit to Gusarich/modded-nanogpt that referenced this pull request Sep 30, 2025

feat: KellerJordan#136 by @ClassicLarry

73dbbb4

ClassicLarry mentioned this pull request Oct 4, 2025

New WR 140.7s: Backout, Misc Hyperparam cleanup, Fix Lambda Count #140

Merged

ClassicLarry merged commit ebf2dfc into KellerJordan:master Oct 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New record 146.8s 09/30/25 (-0.6s): CustomBatching, only update Adam Params every other step #136

New record 146.8s 09/30/25 (-0.6s): CustomBatching, only update Adam Params every other step #136

Uh oh!

ClassicLarry commented Sep 29, 2025

Uh oh!

Gusarich commented Sep 30, 2025

Uh oh!

ClassicLarry commented Sep 30, 2025

Uh oh!

varunneal commented Nov 15, 2025

Uh oh!

ClassicLarry commented Nov 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

New record 146.8s 09/30/25 (-0.6s): CustomBatching, only update Adam Params every other step #136

New record 146.8s 09/30/25 (-0.6s): CustomBatching, only update Adam Params every other step #136

Uh oh!

Conversation

ClassicLarry commented Sep 29, 2025

Timing and Validation

Uh oh!

Gusarich commented Sep 30, 2025

Uh oh!

ClassicLarry commented Sep 30, 2025

Uh oh!

varunneal commented Nov 15, 2025

Uh oh!

ClassicLarry commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ClassicLarry commented Nov 15, 2025 •

edited

Loading