NCCL timeout: rank 0 hangs on buffer sync in pre-forward #20981

addisonklinke · 2025-07-14T20:18:06Z

addisonklinke
Jul 14, 2025

Hi all, when using the basic trainer(devices=8) I can get training to complete fine. However, if I lower the batch size below a critical limit (~1/3rd the normal setting), then I start getting consistent NCCL timeouts. So far, the most direct evidence I've found is the below traceback

[rank0]:   File "/opt/env/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 319, in _training_step
[rank0]:     training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/env/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 328, in _call_strategy_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/env/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 390, in training_step
[rank0]:     return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/env/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 641, in __call__
[rank0]:     wrapper_output = wrapper_module(*args, **kwargs)
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/env/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1632, in forward
[rank0]:     inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/env/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1530, in _pre_forward
[rank0]:     self._sync_buffers()
[rank0]:   File "/opt/env/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 2165, in _sync_buffers
[rank0]:     self._sync_module_buffers(authoritative_rank)
[rank0]:   File "/opt/env/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 2169, in _sync_module_buffers
[rank0]:     self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
[rank0]:   File "/opt/env/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 2191, in _default_broadcast_coalesced
[rank0]:     self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank)
[rank0]:   File "/opt/env/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 2106, in _distributed_broadcast_coalesced
[rank0]:     dist._broadcast_coalesced(
...
[rank0]: RuntimeError: [Rank 0]: Ranks 1, 2, 3, 4, 5, 6, 7 failed to pass monitoredBarrier in 1800000 ms

It appears _sync_buffers() is hanging on rank 0 causing ranks 1+ to hit the default NCCL timeout at 30 minutes. I'm curious whether anyone has seen similar behavior before? Particularly correlated to changes in batch size. I'll be digging into the source code more to understand the point of the buffer sync, but would appreciate any pointers. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NCCL timeout: rank 0 hangs on buffer sync in pre-forward #20981

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

NCCL timeout: rank 0 hangs on buffer sync in pre-forward #20981

Uh oh!

addisonklinke Jul 14, 2025

Replies: 0 comments

addisonklinke
Jul 14, 2025