New WR: Adam Gradient Sync in Backward Hooks #149

akash5474 · 2025-10-31T23:07:39Z

Hello and happy Halloween! I think I've set a new record.

Adam Gradient Sync in Backward Hooks

This PR improves the overall training time and avg training step time by moving the DistAdam gradient sync reduce-scatter collectives for each model parameter out of the step method and into a backward hook. The step method is then modified to iterate through the param groups and parameters in reverse order to benefit from this change by stepping parameters in later layers first. The parameters of later layers will have their backward hooks called sooner, which should result in their gradient syncs being triggered earlier and completed sooner.

Timing and Validation

This PR improves the final training time by ~0.7 seconds

This PR:

import scipy.stats
import torch

losses = [3.2775, 3.2776, 3.2777, 3.2780, 3.2781, 3.2775, 3.2786, 3.2774, 3.2751, 3.2739]
times = [140.909, 140.872, 140.743, 140.743, 140.747, 140.809, 140.728, 140.784, 140.862, 140.934]

print("p=%.4f" % scipy.stats.ttest_1samp(losses, 3.28, alternative="less").pvalue)
# p=0.0001

print("losses:", torch.std_mean(torch.tensor(losses)))
# losses: (std=0.0015, mean=3.2771)

print("time:", torch.std_mean(torch.tensor(times)))
# time: (std=0.0760, mean=140.8131)

Previous PR timed on same machine:

import scipy.stats
import torch

times = [141.654, 141.413, 141.467, 141.516]

print("time:", torch.std_mean(torch.tensor(times)))
# time: (std=0.1033, mean=141.5125)

Changes

Reduce Scatter in `DistAdam` Backward Hook

Even though the collective operations are async, they all occur at the end of each training step. The previous implementation looped through each parameter in order, launched the reduce-scatter operation, and then immediately waited for it to complete.

In this PR I moved the reduce-scatter operation launch out of the step method and into a backward hook, registered using register_post_accumulate_grad_hook to ensure the gradients are ready. Since the backwards hooks will first be executed for later model layers, their reduce-scatters will start and complete first.

Step params and param groups in reverse order

To take advantage of this, I modified the step method of DistAdam to iterate through the param_groups and parameters in reverse order. In our init function we define param groups to group by the parameter tensor's shape. The parameters of later layers are at the end of the parameters lists and since their reduce-scatters are launched earlier, we iterate and wait on the reduce-scatter futures earlier in our step method's loop.

Similarly we iterate through param_groups in reverse because the first group will correspond to the first shape we encounter, therefore parameters for later layers should be contained in later param_groups.

Profiler Trace Analysis

The tracefiles are checked in and can be explored using the perfetto trace viewer

Current Implementation

To start, here is the profiler trace for the current implementation. We can see that the first reduce-scatter operation begins at the start of the DistAdam step.

First Reduce-Scatter

Overview

Overlap

Looking at the GPU streams, we can see that the initial reduce-scatter does not overlap with the main GPU stream.

Hook Implementation

Similarly in the new implementation we can see that the first reduce-scatter is launched by the first hook.

First Reduce-Scatter

Overview

Overlap

This time we can see that the reduce-scatter overlaps with the computation on the main GPU stream.

Refactor gradient synchronization to use futures for reduce scatter.

akash5474 · 2025-10-31T23:18:38Z

train_gpt.py

+    @torch.compile
+    @torch.no_grad()
+    def _sync_gradient(self, param):


I ran my tests with this torch.compile decorator in place but realized afterwards that backward hooks are not supported, so I think this doesn't have any effect

ClassicLarry · 2025-11-01T16:56:49Z

This looks good. There are 2 PRs open before this that will get merged in first and may create a dependency on torch nightly 0926, so this will be retimed with those. I would expect the same impact but cannot guarantee. Since this does not modify the ML it does not require p value check.

Independent of this PR:
I suspect that GPU profiling is under-utilized in this challenge, and under-utilized in general. There are probably other low hanging fruit that use the same technique of overlapping activities. I see benefit in having a guide/article for GPU Profiling 101, Setup and applying to NanoGPT on 8H100. If you have an interest in writing something in that direction, I can add the link to the main readme on the row for this record.

akash5474 · 2025-11-01T21:16:18Z

Thanks for taking a look and for the reply! I originally tested this on a different nightly version and saw similar results before doing my "official" testing, hopefully it still holds.

I would love to write up an article covering GPU profiling 101 on NanoGPT. I am actually working on a series of articles covering simplified implementations of distributed training algos and using the pytorch profiler to analyze and improve them. I only have one post currently on DDP (https://blog.underfit.ai/ddp-profiled) but tbh I want to rewrite and improve it. I'm wrapping up a post on ZeRO-1 and then have a draft on ZeRO-2 next. If this record holds I was planning to write a similar post about the experience, the many mistakes I made along the way, and the other things I tried changing which did not work.

The learning and exploring I did to write those posts is actually what got me here.

akash5474 · 2025-11-10T23:34:25Z

Hi @ClassicLarry, I wrote an intro to profiling post https://blog.underfit.ai/profiling-101-nanogpt. If you or anyone else has any feedback, or if you think there's anything more I should add, feel free to send me a message on X @underfitai (or reply here, though I'm not sure if this PR is the right place for that conversation).

ClassicLarry · 2025-11-12T19:22:06Z

Great, hoping to get this one merged next week.

ClassicLarry · 2025-11-14T23:48:22Z

train_gpt.py

-        reduce_scatter_futures: list[torch.Future] = []
+        world_size = dist.get_world_size()
        all_gather_futures: list[torch.Future] = []
-        grad_slices = []


I am getting an error on line 715 because its expecting grad_slices to exist: g_slice = grad_slices[idx]. May be simplest if, along with updating this param, you resync this PR on-top of the now merged train.py.

I think the PR is likely missing a commit as it doesn't quite match the logs at

modded-nanogpt/records/track_1_short/2025-10-31_AdamSyncGradientHook/aa1a577d-bb94-46ec-82bb-2a09614f918b.txt

Line 762 in 7451edf

# State init

.

Looking at it further, the issue is that the previously merged PR #146 modified some of the structure of DistAdam(). This PR will need to be updated to account for that new structure.

Sorry! I made a mistake while resolving merge conflicts after #146 was merged. g_slice is defined on line 709, so line 715 just needed to be removed.

I did my test runs before #146 was merged so DistAdam changes were not reflected in the logs. I've re-run everything with the latest changes and updated the log files.

ClassicLarry · 2025-11-16T11:02:52Z

Merged at 136.122, getting 0.9s speedup over last record.

akash5474 added 17 commits October 30, 2025 01:50

Add backward hooks for gradient synchronization

d3744dd

Fix _sync_gradient args

63a2c0b

Support initial warmup step

334f577

Fix condition check for reduce scatter futures

1478678

Add check for requires_grad before syncing gradient

1318031

Simplify gradient synchronization hook registration

ef1e8b7

Improve gradient synchronization with futures

0d9ddfc

Refactor gradient synchronization to use futures for reduce scatter.

Refactor parameter handling in training loop

ede5c80

Change parameter tracking to use IDs instead of objects

1e0558b

Cleaned up working code

5a75593

Add logs and readme

2680e11

Update readme

a2e018c

Update readme

7ebb679

Update readme

1d83ea9

Update readme

0894072

Update readme

a62441d

Update readme

1a6495e

akash5474 commented Oct 31, 2025

View reviewed changes

Update readme

888be2d

akash5474 marked this pull request as ready for review October 31, 2025 23:27

Merge branch 'master' into akash5474-adam-hook

7451edf

thib-s mentioned this pull request Nov 12, 2025

New WR: Preconditioned orthogonalization for faster Muon optimizer #155

Draft

ClassicLarry reviewed Nov 14, 2025

View reviewed changes

akash5474 added 3 commits November 15, 2025 10:13

Merge branch 'master' into akash5474-adam-hook

f4f9016

remove old g_slice reference

80ad0bf

Update logs and trace files after running with fixes

98c3641

ClassicLarry merged commit 80d68af into KellerJordan:master Nov 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New WR: Adam Gradient Sync in Backward Hooks #149

New WR: Adam Gradient Sync in Backward Hooks #149

Uh oh!

akash5474 commented Oct 31, 2025 •

edited

Loading

Uh oh!

akash5474 Oct 31, 2025 •

edited

Loading

Uh oh!

ClassicLarry commented Nov 1, 2025

Uh oh!

akash5474 commented Nov 1, 2025 •

edited

Loading

Uh oh!

akash5474 commented Nov 10, 2025

Uh oh!

ClassicLarry commented Nov 12, 2025

Uh oh!

ClassicLarry Nov 14, 2025

Uh oh!

ClassicLarry Nov 14, 2025

Uh oh!

ClassicLarry Nov 15, 2025

Uh oh!

akash5474 Nov 15, 2025 •

edited

Loading

Uh oh!

ClassicLarry commented Nov 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New WR: Adam Gradient Sync in Backward Hooks #149

New WR: Adam Gradient Sync in Backward Hooks #149

Uh oh!

Conversation

akash5474 commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adam Gradient Sync in Backward Hooks

Timing and Validation

Changes

Reduce Scatter in DistAdam Backward Hook

Step params and param groups in reverse order

Profiler Trace Analysis

Current Implementation

First Reduce-Scatter

Overview

Overlap

Hook Implementation

First Reduce-Scatter

Overview

Overlap

Uh oh!

akash5474 Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClassicLarry commented Nov 1, 2025

Uh oh!

akash5474 commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akash5474 commented Nov 10, 2025

Uh oh!

ClassicLarry commented Nov 12, 2025

Uh oh!

ClassicLarry Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

ClassicLarry Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

ClassicLarry Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

akash5474 Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClassicLarry commented Nov 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akash5474 commented Oct 31, 2025 •

edited

Loading

Reduce Scatter in `DistAdam` Backward Hook

akash5474 Oct 31, 2025 •

edited

Loading

akash5474 commented Nov 1, 2025 •

edited

Loading

akash5474 Nov 15, 2025 •

edited

Loading