Skip to content

Conversation

@shengfukevin
Copy link
Contributor

Summary

Fix support to replay all2all.

Test Plan

constructed 4 rank case to invoke torch.distributed.all_to_all() and torch.distributed.all_to_all_single(), then dump trace and replay.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 20, 2025
@facebook-github-bot
Copy link
Contributor

@shengfukevin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary:
Fix support to replay all2all.


Test Plan: constructed 4 rank case to invoke torch.distributed.all_to_all() and torch.distributed.all_to_all_single(), then dump trace and replay.

Differential Revision: D75101007

Pulled By: shengfukevin
@facebook-github-bot
Copy link
Contributor

@shengfukevin has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75101007

Copy link
Contributor Author

@shengfukevin shengfukevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG2M! Have one inline question, please check.

opTensor = []
if allocate:
alloc_func = (
i_alloc_func = (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanshang-nv, what is the reason for the change in this code block

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which point?

  1. i_ should be prefix of input_.
  2. i_scale_factor is the same usage as other prepare function, which is not used before this PR.
  3. the input and output of all_to_all is a list of tensor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I ask is why do you change to logic to create data for input/output tensors?

The original code is to create tensor with initVal when check_data is true, otherwise fill the tensor with all one.
With your change, you create data differently for input and output tensors. In some cases, you use scaleFactor. What is the reason behind this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous logic:
if dcheck is true, use alloc_ones with initVal to create both input and output tensor. Problem is, output tensor should always use random for check.
if dcheck is false, use alloc_random, but still use initVal. It's wrong, should use scale_factor. Otherwise, why pass parameter scaleFactor in.

Fixed logic in PR:
if dcheck is true, use alloc_ones with initVal.
if dcheck is false, use alloc_random with scaleFactor.
output tensor shoudl always be allocated with alloc_random.

take

def _prep_all_gather(
as a reference.

@shengfukevin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think what value to initialize output tensor should not matter. right? Since it will be overwritten.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a stamp from Meta side to get it approved.

@facebook-github-bot
Copy link
Contributor

@shengfukevin merged this pull request in 1fea15e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants