New WR 140.7s: Backout, Misc Hyperparam cleanup, Fix Lambda Count #140

ClassicLarry · 2025-10-04T07:17:25Z

This submission reflects all recent WR changes up to PR#136.

Implementing PR#139 by @snimu with some minor tuning for the short track. In standard transformer architecture contributions to the residual stream have to serve two purposes at once: provide context to downstream layers, and add to the final prediction. Information may be valuable for downstream context but not directly useful for the final prediction. A lambda is added such that context added to the residual stream in the first 8 layers can be backed out before the final prediction.
Hyperparam tuning, pulling down number of steps following last PR
- num_iterations: 2380->2290
- cooldown_frac: 0.4->0.45
- adam beta1: 0.7->0.65
Cleanup extra lambda params, dropping count from 72 to 64. This fixes the hiccups/stuttering and saves 1.8s.

Layer 8 was chosen after implementing a lambda per_layer version and observing these coefficients:
[0.5400, 0.4613, 0.4364, 0.3429, 0.2675, 0.3030, 0.2023,0.3761,-0.0741, -0.2164, -0.2905]

    if i==8:
        x_backout=x

# back out contributions from first 8 layers that are only required for downstream context and not direct prediction
x -= backout_lambda*x_backout
x = norm(x)
logits = self.lm_head(x)

Dropping the extra torch.zeros(num_layers), brings scalars to 64 instead of 72, for a clean 8 params per GPU instead of 9.

pad = (-num_layers * 5 - 2) % dist.get_world_size()
self.scalars = nn.Parameter(
    torch.cat(
        [
            -1.5
            * torch.ones(num_layers),  # skip_weights -> σ(-1.5) ≈ 0.18
            *[
                torch.tensor([1.0, 0.0]) for _ in range(num_layers)
            ],  # block lambdas
            *[
                torch.tensor([0.5, 0.5]) for _ in range(num_layers)
            ],  # SA lambdas
            torch.zeros(1), # smear_lambda
            0.5*torch.ones(1), # backout_lambda
            torch.ones(pad),
        ]
    )
)

If I try to drop this down to 56 by removing 6 extra skips and 2 padding lambdas, the runtime goes up slightly. It appears that param size of 64 performs better in Adam than 56 or 72. Since Adam splits the param 8 ways across GPUs, this means Adam is performing better with an array of size 8 per GPU instead of 7 or 9.

Timing and Validation

import scipy.stats
import torch

losses = [3.2772, 3.2796, 3.2781, 3.2783, 3.2769]
times = [140.626, 140.678, 140.693, 140.718, 140.769]

print("p=%.4f" % scipy.stats.ttest_1samp(losses, 3.28, alternative="less").pvalue)
# p=0.0070

print("losses:", torch.std_mean(torch.tensor(losses)))
# losses: (tensor(0.0011), tensor(3.2780))

print("time:", torch.std_mean(torch.tensor(times)))
# time: (tensor(0.0525), tensor(140.6968))

retiming prior record: 146.9: [147.189,146.906,146.690]

Merge PR 118

See README

…aining, improve skip connection gating, and enhance bfloat16 usage

Gusarich · 2025-10-04T10:27:18Z

incredible!

xTimeCrystal · 2025-10-09T10:45:28Z

ReLU>sigmoid attention gating

snimu · 2025-10-09T10:59:53Z

@xTimeCrystal do you have any experimental results to show this? If yes, you could just make a PR

ClassicLarry and others added 21 commits August 23, 2025 13:08

adding sparse attention gate

9d9dc96

Update README.md

12dfed7

Added FA3 record

84cd472

Merge pull request KellerJordan#1 from varunneal/master

393a478

Merge PR 118

FA3 with flash_attn_varlen_func (KellerJordan#1)

ad8cdc4

See README

Add skip_mlp_block code and results

05502bf

Fix senseless removing of 12 (non-existent) block mlp layer

091cc97

yarn

34ae835

Update ReadMe.md

4bfa3a5

.

eae03e6

New WR 1.25% better than PR KellerJordan#122: Optimize distributed tr…

a96fbbd

…aining, improve skip connection gating, and enhance bfloat16 usage

data threading and final layer window

e899481

cleanup

e8d03f2

add smear

d149ec4

dropattn

2fe38e2

MuonCustomSizing

7d7952d

cleanup

be7479a

Polar Express + HF Kernel for Flash attn

9350f4d

Correction to match logs

a45ac73

CustomBatching

218b14d

backout

ba3e54f

Gusarich added a commit to Gusarich/modded-nanogpt that referenced this pull request Oct 8, 2025

feat: KellerJordan#140 by @ClassicLarry

4ec6b17

suhithr added a commit to suhithr/experiments-in-modded-nanogpt that referenced this pull request Oct 15, 2025

Adds updated trainfile from KellerJordan/modded-nanogpt#140

4bc1bcc

ClassicLarry merged commit 33e7e54 into KellerJordan:master Oct 15, 2025

zichongli5 mentioned this pull request Oct 24, 2025

New WR (-15 steps/-0.7s): Implement NorMuon on latest version #144

Merged

ClassicLarry mentioned this pull request Oct 28, 2025

New WR: Muon improvements: faster step, corrected learning rates (30 steps) #146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New WR 140.7s: Backout, Misc Hyperparam cleanup, Fix Lambda Count #140

New WR 140.7s: Backout, Misc Hyperparam cleanup, Fix Lambda Count #140

Uh oh!

ClassicLarry commented Oct 4, 2025 •

edited

Loading

Uh oh!

Gusarich commented Oct 4, 2025

Uh oh!

xTimeCrystal commented Oct 9, 2025

Uh oh!

snimu commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

New WR 140.7s: Backout, Misc Hyperparam cleanup, Fix Lambda Count #140

New WR 140.7s: Backout, Misc Hyperparam cleanup, Fix Lambda Count #140

Uh oh!

Conversation

ClassicLarry commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Timing and Validation

Uh oh!

Gusarich commented Oct 4, 2025

Uh oh!

xTimeCrystal commented Oct 9, 2025

Uh oh!

snimu commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ClassicLarry commented Oct 4, 2025 •

edited

Loading