New WR: cautious weight decay on Adam (-1.1 seconds) #172

shenberg · 2025-12-18T13:42:44Z

I followed the recommendation of @varunneal in his cautious weight decay and implemented it for Adam. No run-time cost as Adam run-time is dominated by cross-GPU communication. Reduced 20 steps, there's probably room for more.

I disabled weight-decay on the scalars since it seems to me like for some of them, the 'natural' value is not 0 and it would make more sense to do cautious weight decay relative to the 'natural' value. Left this as future work for now.

A small note: The first run had an exceptionally large loss which skewed the mean (2.87), discounting that run we would have a much lower average. I noticed that baseline runs had relatively high but valid losses [3.2793, 3.2778, 3.2802, 3.279]

linux-leo · 2025-12-18T14:50:30Z

maybe the natural value could be calculated using some sort of moving average...

shenberg · 2025-12-18T17:02:46Z

I decided to do this attempt as I was confused by the code in Muon which multiplies the weight-decay factor by the LR twice (I thought it's a bug). I discovered why it's done when reading the readme by @varunneal (scheduling for WD factor), not before I found that I could get roughly the same score without this scheduling (I set the factor for NorMuon a bit higher, at a constant 0.042, and got pretty much the same results). I'm unconvinced that WD scheduling is necessary at the moment, but I did not run the necessary tuning in order to get the correct values without it, so I'll leave this for sometime in the future.

varunneal · 2025-12-18T19:14:04Z

There's some theoretical motivation for multiplying by LR twice https://arxiv.org/abs/2512.08217

but ultimately imo the schedule's important just for decaying wd to 0 by the end of training. I found it effective on Muon. Not sure it's as useful on adam because of the nature of the embeddings

varunneal · 2025-12-18T19:15:53Z

The first run had an exceptionally large loss which skewed the mean

3.287 is very strange but of course possible as the tail end of a normal distribution. Hopefully it is not being caused by cwd

ClassicLarry · 2025-12-19T00:49:14Z

The first run had an exceptionally large loss which skewed the mean

3.287 is very strange but of course possible as the tail end of a normal distribution. Hopefully it is not being caused by cwd

Looks like the runs are higher variance in general. 3.272 is absurdly low and would never happen before. Better understanding what is causing this could unlock a couple seconds.

varunneal · 2025-12-19T05:57:39Z

@ClassicLarry yeah std dev of 0.0036 turns out to be like 3 times higher than what the present record is, or what most records are. I'm looking into whether this is caused by fp8 scales being improperly calibrated for cwd

ClassicLarry · 2025-12-19T06:53:49Z

@ClassicLarry yeah std dev of 0.0036 turns out to be like 3 times higher than what the present record is, or what most records are. I'm looking into whether this is caused by fp8 scales being improperly calibrated for cwd

Good idea. The embed and lm_head are quite different. 75x different learning rate, and each embed only activates like 1/50,000 of the time. I doubt these should have the same decay rate. I haven't validated the runtime yet, but I'm confident the record will hold and this is worth merging in. But I expect I will then prefer testing new changes by temporarily removing anything that introduces high variance, like this change.

I wonder how CWD is impacting sparse embeddings. If an embedding does not show up in a batch it will have a gradient of zero and mask = (update * p_slice) >= 0 will drive that embedding to zero. We might be better off changing '>=' to '>' so that sparse embeddings maintain their size.

On the topic of the scalars, these might be better with different learning rates. Some end up around 25 and some stay between 0 and 1. When I was testing updating the x0 lambda I was finding loss changed on the configuration by a decent amount.

I am curious how CWD is affecting the lm_head vectors of the 300 tokens that never occur during training.

ClassicLarry · 2025-12-19T18:25:37Z

The current decay weight of 0.005 gives
(.00875)^20.005 = 0.0018 decay per step for embed
(.008)^2*0.005 = 3.2e-7 decay per step for lm_head.

So there may be effectively no impact to the lm_head. Planning to look more into what this is actually doing. If we can replicate the 3.272 scenario its another couple seconds off.

ClassicLarry · 2025-12-20T20:26:05Z

Confirmed 1.1s decrease, updating main readme at 127.7s to maintain 1.1s gap from prior record.

shenberg · 2025-12-24T15:13:59Z

@ClassicLarry yeah std dev of 0.0036 turns out to be like 3 times higher than what the present record is, or what most records are. I'm looking into whether this is caused by fp8 scales being improperly calibrated for cwd

Good idea. The embed and lm_head are quite different. 75x different learning rate, and each embed only activates like 1/50,000 of the time. I doubt these should have the same decay rate. I haven't validated the runtime yet, but I'm confident the record will hold and this is worth merging in. But I expect I will then prefer testing new changes by temporarily removing anything that introduces high variance, like this change.

I wonder how CWD is impacting sparse embeddings. If an embedding does not show up in a batch it will have a gradient of zero and mask = (update * p_slice) >= 0 will drive that embedding to zero. We might be better off changing '>=' to '>' so that sparse embeddings maintain their size.

On the topic of the scalars, these might be better with different learning rates. Some end up around 25 and some stay between 0 and 1. When I was testing updating the x0 lambda I was finding loss changed on the configuration by a decent amount.

I am curious how CWD is affecting the lm_head vectors of the 300 tokens that never occur during training.

I read the CWD paper a bit more carefully and tried I tried a few things on Adam that left me thinking that all the value is in dealing with sparse updates correctly and not the cautious part at all:

Giving an additional margin such that weight-decay will never flip the sign of an element if the update wouldn't do it itself. Seemed maybe a teeny bit better. I thought maybe I had something here, but...
Reversing the sign on the mask, so mask = (update * p_slice) < 0, seemed equally ok...
Assuming that it's only sparse gradients that are the problem, so mask = (update * p_slice) != 0 - also seems fine.

It seems to me like the main insight of the CWD paper is to ensure that weight-decay doesn't flip the sign of the update to any specific parameter, so we're still optimizing the original function and not a surrogate. I think one can be less cautious, so weight-decay more, and still get the same objective. Something along the lines of mask = ((update * p_slice) > 0) | ((update * p_slice) < p_slice.square() * (-eff_weight_decay * lr)) ("if directions agree, or if weight-decay has smaller magnitude than optimizer update"). No spot instances available at the moment so I'll see in a bit if this helps at all. Not optimistic due to the non-zero update experiment. Maybe this line of reasoning makes more sense for Muon, though the more expensive mask may make it not-worthwhile there. Adam being comms-constrained makes it ripe for experimentation.

Update: my less-cautious WD mask seems to be a beneficial change - I'll rebase once the <2min record is merged and work on top of it, should allow dropping some more steps, but it needs a nicer parametrization as even in Adam, it measurably slows down steps.

shenberg added 3 commits December 18, 2025 11:02

cautious weight decay on Adam

76019fd

reduce steps, change WD constant

45c04c1

logs and readme

75840a3

varunneal mentioned this pull request Dec 20, 2025

[New Record] Retie Embedding and LM Head #175

Merged

ClassicLarry merged commit 49465cc into KellerJordan:master Dec 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New WR: cautious weight decay on Adam (-1.1 seconds) #172

New WR: cautious weight decay on Adam (-1.1 seconds) #172

Uh oh!

shenberg commented Dec 18, 2025

Uh oh!

linux-leo commented Dec 18, 2025

Uh oh!

shenberg commented Dec 18, 2025

Uh oh!

varunneal commented Dec 18, 2025

Uh oh!

varunneal commented Dec 18, 2025 •

edited

Loading

Uh oh!

ClassicLarry commented Dec 19, 2025

Uh oh!

varunneal commented Dec 19, 2025

Uh oh!

ClassicLarry commented Dec 19, 2025 •

edited

Loading

Uh oh!

ClassicLarry commented Dec 19, 2025

Uh oh!

ClassicLarry commented Dec 20, 2025

Uh oh!

shenberg commented Dec 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

New WR: cautious weight decay on Adam (-1.1 seconds) #172

New WR: cautious weight decay on Adam (-1.1 seconds) #172

Uh oh!

Conversation

shenberg commented Dec 18, 2025

Uh oh!

linux-leo commented Dec 18, 2025

Uh oh!

shenberg commented Dec 18, 2025

Uh oh!

varunneal commented Dec 18, 2025

Uh oh!

varunneal commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ClassicLarry commented Dec 19, 2025

Uh oh!

varunneal commented Dec 19, 2025

Uh oh!

ClassicLarry commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ClassicLarry commented Dec 19, 2025

Uh oh!

ClassicLarry commented Dec 20, 2025

Uh oh!

shenberg commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

varunneal commented Dec 18, 2025 •

edited

Loading

ClassicLarry commented Dec 19, 2025 •

edited

Loading

shenberg commented Dec 24, 2025 •

edited

Loading