Skip to content
This repository was archived by the owner on Nov 1, 2024. It is now read-only.

Conversation

igormolybogFB
Copy link
Contributor

@igormolybogFB igormolybogFB commented Nov 29, 2022

Patch Description
Added a barrierlessenv init_method for pytorch distributed method init_process_group. This is done for faster initialization of the process group

Testing steps
Run with --distributed_init_method barrierlessenv://

Copy link
Contributor

@suchenzang suchenzang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yesss! one question though: can we combine both this rendezvous with the retry logic in tcpr ?

@igormolybogFB
Copy link
Contributor Author

@suchenzang do we have examples of usage of tcpr in our main?

igormolybogFB and others added 2 commits December 1, 2022 10:06
@stephenroller
Copy link
Contributor

i don't understand how this works. can you explain?

Comment on lines +90 to +91
if STORE_BASED_BARRIER_MARKER in key:
return self.world_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, specifically if it's the initialization, we just pretend it's done.

I noticed this part of the docs on TCPStore:

https://pytorch.org/docs/stable/distributed.html#torch.distributed.TCPStore

Namely, on construction it has a wait_for_worker=True default parameter, which we are not currently setting. Maybe we can get the same effect by just turning this off?

Copy link
Contributor

@stephenroller stephenroller Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, there's actually two barriers in pytorch

One in C++, for TCP store:
https://github.com/pytorch/pytorch/blob/c18da597e0bb1c1aecc97c77a73fed1849057fa4/torch/csrc/distributed/c10d/TCPStore.cpp#L1012-L1050

And one in the process group initialization:
https://github.com/pytorch/pytorch/blob/master/torch/distributed/distributed_c10d.py#L851-L863

And then a third in our own code:

global_barrier()

So maybe we can combine both TCPStore(wait_for_worker=False) and this and get even better overlap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephenroller what do you want to combine with TCPStore, once again?

As far as I understand, you want to just add wait_for_workers=False in _create_c10d_store (

hostname, port, world_size, start_daemon, timeout, multi_tenant=True
) and continue using tcpr?

It looks to me like that should work. @lw what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @stephenroller meant to do both this PR and pass the wait_for_worker=False flag to the TCPStore constructor, since they control two different barriers (both of which could be a problem).

I believe the wait_for_workers=False flag will have limited effect, because it only controls one barrier which happens in the constructor of the store, and the store is only constructed once! (The sub-PGs re-use the same store as the root PG, they don't create a new one). I do not know why that barrier was needed and what safety implications come with removing it, I'd suggest looking into that.

I'd personally recommend removing each of these barriers in a separate commit, to test their performance and correctness impact separately.

Copy link
Contributor

@stephenroller stephenroller Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to separating the proposed wait_for_worker=False into an entirely separate PR and testing independently.

@suchenzang suchenzang merged commit 0f26f9d into main Jan 19, 2023
@suchenzang suchenzang deleted the igor/barrierless branch January 19, 2023 17:24
suchenzang pushed a commit that referenced this pull request Jan 19, 2023
* added P561184389

* removed comments

* fixed torch

* barrierlesstcpr

* Update metaseq/distributed/rendezvous.py

simplification by @lw

Co-authored-by: Luca Wehrstedt <[email protected]>

* barrierlesstcpr

* black

Co-authored-by: Luca Wehrstedt <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants