New WR 148.3s: Compute cross entropy in BF16 during training #133

Gusarich · 2025-09-27T15:11:49Z

This PR builds on all recent improvements, up to #132

Removed a .float() cast before loss so training keeps logits in BF16 all the way into F.cross_entropy. Validation still casts logits to FP32 to prevent BF16 rounding noise in the reported CE and keep results comparable with prior runs.

Commit with train_gpt.py change: 346f4cb

Validation for #132:

import scipy.stats
import torch

losses = [3.2786, 3.2769, 3.2774, 3.2762, 3.2799, 3.2789, 3.2812, 3.2779, 3.2811, 3.2785, 3.2764, 3.2777, 3.2770, 3.2789, 3.2782, 3.2804, 3.2783, 3.2793]

times = [149.199, 148.971, 148.927, 148.953, 149.200, 149.090, 149.339, 149.017, 149.129, 149.014, 149.072, 149.125, 149.171, 149.070, 149.395, 149.124, 149.116, 149.224]

print("p=%.4f" % scipy.stats.ttest_1samp(losses, 3.28, alternative="less").pvalue)
# p=0.0002
print("losses:", torch.std_mean(torch.tensor(losses)))
# losses: (tensor(0.0015), tensor(3.2785))
print("time:", torch.std_mean(torch.tensor(times)))
# time: (tensor(0.1248), tensor(149.1187))

Validation for this WR:

import scipy.stats
import torch

losses = [3.2805, 3.2773, 3.2782, 3.2782, 3.2785, 3.2752, 3.2803, 3.2789, 3.2796, 3.2805, 3.2774, 3.2828, 3.2791, 3.2800, 3.2774, 3.2799, 3.2787, 3.2789, 3.2774, 3.2780, 3.2786, 3.2783, 3.2790, 3.2823]

times = [148.289, 148.497, 148.698, 148.593, 148.573, 148.621, 148.400, 148.536, 148.560, 148.507, 148.262, 148.262, 148.156, 148.144, 148.220, 148.190, 148.201, 148.296, 148.273, 148.189, 148.277, 148.196, 148.119, 148.311]

print("p=%.4f" % scipy.stats.ttest_1samp(losses, 3.28, alternative="less").pvalue)
# p=0.0025
print("losses:", torch.std_mean(torch.tensor(losses)))
# losses: (tensor(0.0016), tensor(3.2790))
print("time:", torch.std_mean(torch.tensor(times)))
# time: (tensor(0.1761), tensor(148.3487))

Gusarich · 2025-09-27T16:11:55Z

Would love to remove the very last entry with 3.2823 loss, as it messes up both p-value and mean loss. But doesn't feel fair considering I've already opened PR. Doing more runs is a better idea, but I've already terminated the instance :(

Result without the very last entry:

p=0.0005
losses: (tensor(0.0015), tensor(3.2788))
time: (tensor(0.1799), tensor(148.3504))

ClassicLarry · 2025-09-27T17:24:27Z

Awesome, great improvement! Looks like loss is still below 0.01 so no issues. I see you are using PyTorch 2.10.0.dev20250926+cu126. Either your GPU's happen to run slightly faster or there is a small boost from the pytorch version as well.

I think I will remove iteration_extension from a future PR when I have a meaningful improvement to compensate because, even though it seems to improves mean loss, it also seems to add more variance to runs which makes testing other updates more challenging.

Gusarich · 2025-09-27T17:34:40Z

I see you are using PyTorch 2.10.0.dev20250926+cu126. Either your GPU's happen to run slightly faster or there is a small boost from the pytorch version as well.

I got lucky with GPUs on PrimeIntellect! I started multiple instances and selected the fastest based on the test run.

ClassicLarry · 2025-09-27T17:39:01Z

This PR does not include the prior record txt or readme files. The pull request log is starting to get rather long, and probably very hard to follow for people seeing the repo for the first time. @KellerJordan is there any plan to perform merges or add maintainers? Do you have any thoughts on a community-supported branch getting spun up that is actively maintained?

Gusarich · 2025-09-27T17:41:13Z

This PR does not include the prior record txt or readme files.

It won't be a problem once previous PRs are merged

ClassicLarry · 2025-09-27T17:43:18Z

I see you are using PyTorch 2.10.0.dev20250926+cu126. Either your GPU's happen to run slightly faster or there is a small boost from the pytorch version as well.

I got lucky with GPUs on PrimeIntellect! I started multiple instances and selected the fastest based on the test run.

Good to know. The next record may have a higher runtime and just need to benchmark that its a faster time relative to this one when controlling for GPUs. 148.3 is crazy.

varunneal · 2025-09-28T18:16:27Z

I'm having some issues building Flash Attention on this version of Torch Nightly. Did you encounter any issues?

Gusarich · 2025-09-28T18:52:21Z

I'm having some issues building Flash Attention on this version of Torch Nightly. Did you encounter any issues?

Try following instructions from #118

And yes, I had a few additional problems, sorry that I forgot to mention that.

I had to revert FA3 branch by 2 commits to guilhermeleobas/flash-attention@5c1627a
I also had some issues when using specific GPU providers. But with most of them it seems to work.

Gusarich added 3 commits September 27, 2025 15:40

feat: KellerJordan#132 by @ClassicLarry

89a1f62

feat: train-time ce in bf16

346f4cb

feat: logs

b094558

Gusarich added a commit to Gusarich/modded-nanogpt that referenced this pull request Sep 27, 2025

feat: KellerJordan#133 by @Gusarich

86b1e5d

varunneal mentioned this pull request Sep 29, 2025

New WR: Polar express (-10 steps) + packaged Flash Attention quality of life improvement #134

Merged

Merge branch 'master' into train-time-ce-in-bf16

9c98ec9

ClassicLarry merged commit fab427a into KellerJordan:master Oct 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New WR 148.3s: Compute cross entropy in BF16 during training #133

New WR 148.3s: Compute cross entropy in BF16 during training #133

Uh oh!

Gusarich commented Sep 27, 2025

Uh oh!

Gusarich commented Sep 27, 2025 •

edited

Loading

Uh oh!

ClassicLarry commented Sep 27, 2025

Uh oh!

Gusarich commented Sep 27, 2025

Uh oh!

ClassicLarry commented Sep 27, 2025

Uh oh!

Gusarich commented Sep 27, 2025

Uh oh!

ClassicLarry commented Sep 27, 2025

Uh oh!

varunneal commented Sep 28, 2025

Uh oh!

Gusarich commented Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

New WR 148.3s: Compute cross entropy in BF16 during training #133

New WR 148.3s: Compute cross entropy in BF16 during training #133

Uh oh!

Conversation

Gusarich commented Sep 27, 2025

Uh oh!

Gusarich commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ClassicLarry commented Sep 27, 2025

Uh oh!

Gusarich commented Sep 27, 2025

Uh oh!

ClassicLarry commented Sep 27, 2025

Uh oh!

Gusarich commented Sep 27, 2025

Uh oh!

ClassicLarry commented Sep 27, 2025

Uh oh!

varunneal commented Sep 28, 2025

Uh oh!

Gusarich commented Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Gusarich commented Sep 27, 2025 •

edited

Loading