-
Notifications
You must be signed in to change notification settings - Fork 720
barrierless rendezvous for faster init #535
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yesss! one question though: can we combine both this rendezvous with the retry logic in tcpr ?
@suchenzang do we have examples of usage of tcpr in our main? |
simplification by @lw Co-authored-by: Luca Wehrstedt <[email protected]>
i don't understand how this works. can you explain? |
if STORE_BASED_BARRIER_MARKER in key: | ||
return self.world_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, specifically if it's the initialization, we just pretend it's done.
I noticed this part of the docs on TCPStore:
https://pytorch.org/docs/stable/distributed.html#torch.distributed.TCPStore
Namely, on construction it has a wait_for_worker=True
default parameter, which we are not currently setting. Maybe we can get the same effect by just turning this off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, there's actually two barriers in pytorch
One in C++, for TCP store:
https://github.com/pytorch/pytorch/blob/c18da597e0bb1c1aecc97c77a73fed1849057fa4/torch/csrc/distributed/c10d/TCPStore.cpp#L1012-L1050
And one in the process group initialization:
https://github.com/pytorch/pytorch/blob/master/torch/distributed/distributed_c10d.py#L851-L863
And then a third in our own code:
metaseq/metaseq/distributed/utils.py
Line 224 in 2e882aa
global_barrier() |
So maybe we can combine both TCPStore(wait_for_worker=False) and this and get even better overlap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephenroller what do you want to combine with TCPStore, once again?
As far as I understand, you want to just add wait_for_workers=False
in _create_c10d_store (
metaseq/metaseq/distributed/rendezvous.py
Line 25 in 2e882aa
hostname, port, world_size, start_daemon, timeout, multi_tenant=True |
It looks to me like that should work. @lw what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @stephenroller meant to do both this PR and pass the wait_for_worker=False
flag to the TCPStore constructor, since they control two different barriers (both of which could be a problem).
I believe the wait_for_workers=False
flag will have limited effect, because it only controls one barrier which happens in the constructor of the store, and the store is only constructed once! (The sub-PGs re-use the same store as the root PG, they don't create a new one). I do not know why that barrier was needed and what safety implications come with removing it, I'd suggest looking into that.
I'd personally recommend removing each of these barriers in a separate commit, to test their performance and correctness impact separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to separating the proposed wait_for_worker=False into an entirely separate PR and testing independently.
* added P561184389 * removed comments * fixed torch * barrierlesstcpr * Update metaseq/distributed/rendezvous.py simplification by @lw Co-authored-by: Luca Wehrstedt <[email protected]> * barrierlesstcpr * black Co-authored-by: Luca Wehrstedt <[email protected]>
Patch Description
Added a barrierlessenv init_method for pytorch distributed method init_process_group. This is done for faster initialization of the process group
Testing steps
Run with --distributed_init_method barrierlessenv://