Shengf/fix all2all #206

shengfukevin · 2025-05-20T22:07:22Z

Summary

Fix support to replay all2all.

Test Plan

constructed 4 rank case to invoke torch.distributed.all_to_all() and torch.distributed.all_to_all_single(), then dump trace and replay.

facebook-github-bot · 2025-05-20T22:09:35Z

@shengfukevin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Fix support to replay all2all. Test Plan: constructed 4 rank case to invoke torch.distributed.all_to_all() and torch.distributed.all_to_all_single(), then dump trace and replay. Differential Revision: D75101007 Pulled By: shengfukevin

facebook-github-bot · 2025-05-20T22:21:02Z

@shengfukevin has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-05-20T22:21:10Z

This pull request was exported from Phabricator. Differential Revision: D75101007

shengfukevin

LG2M! Have one inline question, please check.

shengfukevin · 2025-05-20T22:26:34Z

et_replay/comm/comms_utils.py

        opTensor = []
        if allocate:
-            alloc_func = (
+            i_alloc_func = (


@sanshang-nv, what is the reason for the change in this code block

Which point?

i_ should be prefix of input_.

i_scale_factor is the same usage as other prepare function, which is not used before this PR.

the input and output of all_to_all is a list of tensor.

What I ask is why do you change to logic to create data for input/output tensors?

The original code is to create tensor with initVal when check_data is true, otherwise fill the tensor with all one.
With your change, you create data differently for input and output tensors. In some cases, you use scaleFactor. What is the reason behind this change?

Previous logic:
if dcheck is true, use alloc_ones with initVal to create both input and output tensor. Problem is, output tensor should always use random for check.
if dcheck is false, use alloc_random, but still use initVal. It's wrong, should use scale_factor. Otherwise, why pass parameter scaleFactor in.

Fixed logic in PR:
if dcheck is true, use alloc_ones with initVal.
if dcheck is false, use alloc_random with scaleFactor.
output tensor shoudl always be allocated with alloc_random.

take

param/et_replay/comm/comms_utils.py

Line 966 in 656b66c

def _prep_all_gather(

as a reference.

@shengfukevin

Thanks. I think what value to initialize output tensor should not matter. right? Since it will be overwritten.

I need a stamp from Meta side to get it approved.

facebook-github-bot · 2025-05-23T02:01:39Z

@shengfukevin merged this pull request in 1fea15e.

shengfukevin requested review from briancoutinho, kingchc, louisfeng, shengbao-zheng and sunghlin as code owners May 20, 2025 22:07

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 20, 2025

Shengf/fix all2all (#206)

457f45c

Summary: Fix support to replay all2all. Test Plan: constructed 4 rank case to invoke torch.distributed.all_to_all() and torch.distributed.all_to_all_single(), then dump trace and replay. Differential Revision: D75101007 Pulled By: shengfukevin

facebook-github-bot force-pushed the shengf/fix_all2all branch from d284c4c to 457f45c Compare May 20, 2025 22:20

facebook-github-bot added the fb-exported label May 20, 2025

shengfukevin commented May 20, 2025

View reviewed changes

facebook-github-bot closed this in 1fea15e May 23, 2025

facebook-github-bot added the Merged label May 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shengf/fix all2all #206

Shengf/fix all2all #206

Uh oh!

shengfukevin commented May 20, 2025

Uh oh!

facebook-github-bot commented May 20, 2025

Uh oh!

facebook-github-bot commented May 20, 2025

Uh oh!

facebook-github-bot commented May 20, 2025

Uh oh!

shengfukevin left a comment

Uh oh!

shengfukevin May 20, 2025

Uh oh!

sanshang-nv May 21, 2025

Uh oh!

shengfukevin May 21, 2025

Uh oh!

sanshang-nv May 22, 2025

Uh oh!

shengfukevin May 22, 2025

Uh oh!

shengfukevin May 22, 2025

Uh oh!

facebook-github-bot commented May 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Shengf/fix all2all #206

Shengf/fix all2all #206

Uh oh!

Conversation

shengfukevin commented May 20, 2025

Summary

Test Plan

Uh oh!

facebook-github-bot commented May 20, 2025

Uh oh!

facebook-github-bot commented May 20, 2025

Uh oh!

facebook-github-bot commented May 20, 2025

Uh oh!

shengfukevin left a comment

Choose a reason for hiding this comment

Uh oh!

shengfukevin May 20, 2025

Choose a reason for hiding this comment

Uh oh!

sanshang-nv May 21, 2025

Choose a reason for hiding this comment

Uh oh!

shengfukevin May 21, 2025

Choose a reason for hiding this comment

Uh oh!

sanshang-nv May 22, 2025

Choose a reason for hiding this comment

Uh oh!

shengfukevin May 22, 2025

Choose a reason for hiding this comment

Uh oh!

shengfukevin May 22, 2025

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented May 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants