NCCL timeout: rank 0 hangs on buffer sync in pre-forward #20981
Unanswered
addisonklinke
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all, when using the basic
trainer(devices=8)
I can get training to complete fine. However, if I lower the batch size below a critical limit (~1/3rd the normal setting), then I start getting consistent NCCL timeouts. So far, the most direct evidence I've found is the below tracebackIt appears
_sync_buffers()
is hanging on rank 0 causing ranks 1+ to hit the default NCCL timeout at 30 minutes. I'm curious whether anyone has seen similar behavior before? Particularly correlated to changes in batch size. I'll be digging into the source code more to understand the point of the buffer sync, but would appreciate any pointers. Thanks!Beta Was this translation helpful? Give feedback.
All reactions