Skip to content

Conversation

@Gusarich
Copy link
Contributor

This PR builds on all recent improvements, up to #132

Removed a .float() cast before loss so training keeps logits in BF16 all the way into F.cross_entropy. Validation still casts logits to FP32 to prevent BF16 rounding noise in the reported CE and keep results comparable with prior runs.

Commit with train_gpt.py change: 346f4cb

Validation for #132:

import scipy.stats
import torch

losses = [3.2786, 3.2769, 3.2774, 3.2762, 3.2799, 3.2789, 3.2812, 3.2779, 3.2811, 3.2785, 3.2764, 3.2777, 3.2770, 3.2789, 3.2782, 3.2804, 3.2783, 3.2793]

times = [149.199, 148.971, 148.927, 148.953, 149.200, 149.090, 149.339, 149.017, 149.129, 149.014, 149.072, 149.125, 149.171, 149.070, 149.395, 149.124, 149.116, 149.224]

print("p=%.4f" % scipy.stats.ttest_1samp(losses, 3.28, alternative="less").pvalue)
# p=0.0002
print("losses:", torch.std_mean(torch.tensor(losses)))
# losses: (tensor(0.0015), tensor(3.2785))
print("time:", torch.std_mean(torch.tensor(times)))
# time: (tensor(0.1248), tensor(149.1187))

Validation for this WR:

import scipy.stats
import torch

losses = [3.2805, 3.2773, 3.2782, 3.2782, 3.2785, 3.2752, 3.2803, 3.2789, 3.2796, 3.2805, 3.2774, 3.2828, 3.2791, 3.2800, 3.2774, 3.2799, 3.2787, 3.2789, 3.2774, 3.2780, 3.2786, 3.2783, 3.2790, 3.2823]

times = [148.289, 148.497, 148.698, 148.593, 148.573, 148.621, 148.400, 148.536, 148.560, 148.507, 148.262, 148.262, 148.156, 148.144, 148.220, 148.190, 148.201, 148.296, 148.273, 148.189, 148.277, 148.196, 148.119, 148.311]

print("p=%.4f" % scipy.stats.ttest_1samp(losses, 3.28, alternative="less").pvalue)
# p=0.0025
print("losses:", torch.std_mean(torch.tensor(losses)))
# losses: (tensor(0.0016), tensor(3.2790))
print("time:", torch.std_mean(torch.tensor(times)))
# time: (tensor(0.1761), tensor(148.3487))

@Gusarich
Copy link
Contributor Author

Gusarich commented Sep 27, 2025

Would love to remove the very last entry with 3.2823 loss, as it messes up both p-value and mean loss. But doesn't feel fair considering I've already opened PR. Doing more runs is a better idea, but I've already terminated the instance :(

Result without the very last entry:

p=0.0005
losses: (tensor(0.0015), tensor(3.2788))
time: (tensor(0.1799), tensor(148.3504))

@ClassicLarry
Copy link
Collaborator

Awesome, great improvement! Looks like loss is still below 0.01 so no issues. I see you are using PyTorch 2.10.0.dev20250926+cu126. Either your GPU's happen to run slightly faster or there is a small boost from the pytorch version as well.

I think I will remove iteration_extension from a future PR when I have a meaningful improvement to compensate because, even though it seems to improves mean loss, it also seems to add more variance to runs which makes testing other updates more challenging.

@Gusarich
Copy link
Contributor Author

I see you are using PyTorch 2.10.0.dev20250926+cu126. Either your GPU's happen to run slightly faster or there is a small boost from the pytorch version as well.

I got lucky with GPUs on PrimeIntellect! I started multiple instances and selected the fastest based on the test run.

@ClassicLarry
Copy link
Collaborator

This PR does not include the prior record txt or readme files. The pull request log is starting to get rather long, and probably very hard to follow for people seeing the repo for the first time. @KellerJordan is there any plan to perform merges or add maintainers? Do you have any thoughts on a community-supported branch getting spun up that is actively maintained?

@Gusarich
Copy link
Contributor Author

This PR does not include the prior record txt or readme files.

It won't be a problem once previous PRs are merged

@ClassicLarry
Copy link
Collaborator

I see you are using PyTorch 2.10.0.dev20250926+cu126. Either your GPU's happen to run slightly faster or there is a small boost from the pytorch version as well.

I got lucky with GPUs on PrimeIntellect! I started multiple instances and selected the fastest based on the test run.

Good to know. The next record may have a higher runtime and just need to benchmark that its a faster time relative to this one when controlling for GPUs. 148.3 is crazy.

Gusarich added a commit to Gusarich/modded-nanogpt that referenced this pull request Sep 27, 2025
@varunneal
Copy link
Contributor

I'm having some issues building Flash Attention on this version of Torch Nightly. Did you encounter any issues?

@Gusarich
Copy link
Contributor Author

I'm having some issues building Flash Attention on this version of Torch Nightly. Did you encounter any issues?

Try following instructions from #118

And yes, I had a few additional problems, sorry that I forgot to mention that.

@ClassicLarry ClassicLarry merged commit fab427a into KellerJordan:master Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants