barrierless rendezvous for faster init #535

igormolybogFB · 2022-11-29T03:25:14Z

Patch Description
Added a barrierlessenv init_method for pytorch distributed method init_process_group. This is done for faster initialization of the process group

Testing steps
Run with --distributed_init_method barrierlessenv://

suchenzang

yesss! one question though: can we combine both this rendezvous with the retry logic in tcpr ?

igormolybogFB · 2022-11-29T16:52:52Z

@suchenzang do we have examples of usage of tcpr in our main?

metaseq/distributed/rendezvous.py

@lw

simplification by @lw Co-authored-by: Luca Wehrstedt <[email protected]>

stephenroller · 2022-12-01T21:26:51Z

i don't understand how this works. can you explain?

stephenroller · 2022-12-01T21:37:43Z

metaseq/distributed/rendezvous.py

+        if STORE_BASED_BARRIER_MARKER in key:
+            return self.world_size


I see, specifically if it's the initialization, we just pretend it's done.

I noticed this part of the docs on TCPStore:

https://pytorch.org/docs/stable/distributed.html#torch.distributed.TCPStore

Namely, on construction it has a wait_for_worker=True default parameter, which we are not currently setting. Maybe we can get the same effect by just turning this off?

I see, there's actually two barriers in pytorch

One in C++, for TCP store:
https://github.com/pytorch/pytorch/blob/c18da597e0bb1c1aecc97c77a73fed1849057fa4/torch/csrc/distributed/c10d/TCPStore.cpp#L1012-L1050

And one in the process group initialization:
https://github.com/pytorch/pytorch/blob/master/torch/distributed/distributed_c10d.py#L851-L863

And then a third in our own code:

metaseq/metaseq/distributed/utils.py

Line 224 in 2e882aa

global_barrier()

So maybe we can combine both TCPStore(wait_for_worker=False) and this and get even better overlap?

@stephenroller what do you want to combine with TCPStore, once again?

As far as I understand, you want to just add wait_for_workers=False in _create_c10d_store (

metaseq/metaseq/distributed/rendezvous.py

Line 25 in 2e882aa

hostname, port, world_size, start_daemon, timeout, multi_tenant=True

) and continue using tcpr?

It looks to me like that should work. @lw what do you think?

I think @stephenroller meant to do both this PR and pass the wait_for_worker=False flag to the TCPStore constructor, since they control two different barriers (both of which could be a problem).

I believe the wait_for_workers=False flag will have limited effect, because it only controls one barrier which happens in the constructor of the store, and the store is only constructed once! (The sub-PGs re-use the same store as the root PG, they don't create a new one). I do not know why that barrier was needed and what safety implications come with removing it, I'd suggest looking into that.

I'd personally recommend removing each of these barriers in a separate commit, to test their performance and correctness impact separately.

+1 to separating the proposed wait_for_worker=False into an entirely separate PR and testing independently.

metaseq/distributed/rendezvous.py

@lw

* added P561184389 * removed comments * fixed torch * barrierlesstcpr * Update metaseq/distributed/rendezvous.py simplification by @lw Co-authored-by: Luca Wehrstedt <[email protected]> * barrierlesstcpr * black Co-authored-by: Luca Wehrstedt <[email protected]>

added P561184389

29fe139

facebook-github-bot added the cla signed label Nov 29, 2022

igormolybogFB added 2 commits November 29, 2022 04:08

removed comments

781aaf2

fixed torch

4a5fada

igormolybogFB marked this pull request as ready for review November 29, 2022 13:42

igormolybogFB requested review from suchenzang, stephenroller, ngoyal2707, punitkoura, moyapchen, klshuster, ruanslv, davides and Xirider as code owners November 29, 2022 13:42

suchenzang reviewed Nov 29, 2022

View reviewed changes

barrierlesstcpr

25a56dd

lw approved these changes Nov 30, 2022

View reviewed changes

metaseq/distributed/rendezvous.py Outdated Show resolved Hide resolved

suchenzang approved these changes Nov 30, 2022

View reviewed changes

igormolybogFB and others added 2 commits December 1, 2022 10:06

Update metaseq/distributed/rendezvous.py

fb9d2e5

simplification by @lw Co-authored-by: Luca Wehrstedt <[email protected]>

barrierlesstcpr

95ea093

stephenroller reviewed Dec 1, 2022

View reviewed changes

igormolybogFB added 2 commits December 5, 2022 10:15

black

5db3c24

Merge branch 'main' into igor/barrierless

9a3eac6

suchenzang reviewed Dec 20, 2022

View reviewed changes

metaseq/distributed/rendezvous.py Show resolved Hide resolved

suchenzang merged commit 0f26f9d into main Jan 19, 2023

suchenzang deleted the igor/barrierless branch January 19, 2023 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

barrierless rendezvous for faster init #535

barrierless rendezvous for faster init #535

Uh oh!

igormolybogFB commented Nov 29, 2022 •

edited

Loading

Uh oh!

suchenzang left a comment

Uh oh!

igormolybogFB commented Nov 29, 2022

Uh oh!

Uh oh!

stephenroller commented Dec 1, 2022

Uh oh!

stephenroller Dec 1, 2022

Uh oh!

stephenroller Dec 1, 2022 •

edited

Loading

Uh oh!

igormolybogFB Dec 2, 2022

Uh oh!

lw Dec 2, 2022

Uh oh!

stephenroller Dec 2, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

barrierless rendezvous for faster init #535

barrierless rendezvous for faster init #535

Uh oh!

Conversation

igormolybogFB commented Nov 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

suchenzang left a comment

Choose a reason for hiding this comment

Uh oh!

igormolybogFB commented Nov 29, 2022

Uh oh!

Uh oh!

stephenroller commented Dec 1, 2022

Uh oh!

stephenroller Dec 1, 2022

Choose a reason for hiding this comment

Uh oh!

stephenroller Dec 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igormolybogFB Dec 2, 2022

Choose a reason for hiding this comment

Uh oh!

lw Dec 2, 2022

Choose a reason for hiding this comment

Uh oh!

stephenroller Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

igormolybogFB commented Nov 29, 2022 •

edited

Loading

stephenroller Dec 1, 2022 •

edited

Loading

stephenroller Dec 2, 2022 •

edited

Loading