Skip to content

General Changes for multi accelerators #145521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rahulsingh-intel
Copy link
Contributor

@rahulsingh-intel rahulsingh-intel commented Jan 23, 2025

Intend to generailze the framework for multiple accelerators.
Major changes includes:

Add TEST_CUDA & TEST_HPU condition for generalization at common place.
Move ".cuda()" to ".to(device_type)" call

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Copy link

pytorch-bot bot commented Jan 23, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145521

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit 679b12c with merge base 2a9e737 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jan 23, 2025
@cpuhrsch cpuhrsch requested review from albanD and wconstab January 24, 2025 03:46
@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 24, 2025
@rahulsingh-intel
Copy link
Contributor Author

Hi @kwen2501 , please review the changes. Thanks.

torch.cuda.nccl.version() < version,
f"Requires NCCL version greater than or equal to: {version}, found: {torch.cuda.nccl.version()}, reason: {msg}",
)
return lambda func: func
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this change? What does return lambda func: func do?

What is the desired behavior here? If TEST_CUDA is not enabled, do you want this macro to skip_but_pass? or you want the test to run? a bit more description of intended behavior in the PR desc would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires_nccl decorator is only being used for CUDA environment so skipping it in non cuda enviroment using lambda func: func which return a no-op decorator if TEST_CUDA is False.
So changed here in common place instead of modifying tests modules where it being called.
Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return no-op decorator could make other platforms run a test that requires nccl and could cause a failure, it breaks the logic of requires_nccl().


device_type = torch.device(get_devtype())

DEVICE_COUNT = _get_device_module(device_type.type).device_count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace this with
torch.get_device_module(device_type.type).device_count()

def get_device_module(device: _Optional[_Union[torch.device, str]] = None):

parametrize,
run_tests,
TEST_WITH_DEV_DBG_ASAN,
)


device_type = torch.device(get_devtype())

DEVICE_COUNT = _get_device_module(device_type.type).device_count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace this API with

def get_device_module(device: _Optional[_Union[torch.device, str]] = None):

TEST_WITH_DEV_DBG_ASAN,
)


device_type = torch.device(get_devtype())

DEVICE_COUNT = _get_device_module(device_type.type).device_count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with

def get_device_module(device: _Optional[_Union[torch.device, str]] = None):



class TestFSDPStateDict(FSDPTest):
@property
def world_size(self):
return min(torch.cuda.device_count(), 2)
return min(_get_device_module(device_type.type).device_count(), 2)
Copy link
Contributor

@ankurneog ankurneog Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with

def get_device_module(device: _Optional[_Union[torch.device, str]] = None):

i.e torch.get_device_module(device_type).device_count()

@@ -1272,7 +1277,7 @@ def test_world_size_one(self):
class TestFSDPStateDict4GPUs(FSDPTest):
@property
def world_size(self):
return torch.cuda.device_count()
return max(_get_device_module(device_type.type).device_count(), 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with torch.get_device_module(device_type.type).device_count()

def get_device_module(device: _Optional[_Union[torch.device, str]] = None):

from torch.testing._internal.distributed._tensor.common_dtensor import (
DTensorTestBase,
skip_if_lt_x_gpu,
with_comms,
)


device_type = torch.device(get_devtype())

DEVICE_COUNT = _get_device_module(device_type.type).device_count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with torch.get_device_module(device_type.type).device_count())

@rahulsingh-intel
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/145521/head returned non-zero exit code 1

Rebasing (1/1)
Auto-merging test/distributed/fsdp/test_fsdp_grad_acc.py
Auto-merging test/distributed/fsdp/test_fsdp_optim_state.py
CONFLICT (content): Merge conflict in test/distributed/fsdp/test_fsdp_optim_state.py
error: could not apply d99f0ddf0d1... General Changes for multi accelerators
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply d99f0ddf0d1... General Changes for multi accelerators

Raised by https://github.com/pytorch/pytorch/actions/runs/13164618168

@rahulsingh-intel
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased common_fsdp onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout common_fsdp && git pull --rebase)

@rahulsingh-intel
Copy link
Contributor Author

hi @kwen2501 please review.

@AnantGulati
Copy link
Contributor

@albanD @wconstab @kwen2501 @EikanWang
Could you please help with this review it has been stuck for some time now

@AnantGulati
Copy link
Contributor

AnantGulati commented Feb 20, 2025

Hi @H-Huang @albanD @wconstab @kwen2501 @fegin

Could you please help with this review. All comments have been addressed

Thanks

@AnantGulati
Copy link
Contributor

@wconstab @albanD @awgu @kwen2501 @fegin @H-Huang @c-p-i-o
Could you please help with this PR approval. All comments have been addressed
Thanks

@EikanWang
Copy link
Collaborator

@zhangxiaoli73 , could you help check if your FSDP related PR has covered some of this PR?

@zhangxiaoli73
Copy link
Contributor

@zhangxiaoli73 , could you help check if your FSDP related PR has covered some of this PR?

My PR is leveraging generalized tests from Anant to enhance XCCL backend support on XPU. So there is no conflict and duplication now.

@AnantGulati
Copy link
Contributor

"Try to land this since the failure is unrelated."
@pytorchbot merge

Copy link

pytorch-bot bot commented Feb 26, 2025

This PR needs to be approved by an authorized maintainer before merge.

@AnantGulati
Copy link
Contributor

@albanD @wconstab @kwen2501 @fegin
All comments have been addressed and the CI failure appears to be unrelated. Could you please help with the approval for this PR

@AnantGulati
Copy link
Contributor

@weifengpy Could you please review this PR

@AnantGulati
Copy link
Contributor

@wconstab @albanD
Could you please help with approval for this PR.
All comments have been addressed and the error seems to be unrelated
Thanks

@weifengpy
Copy link
Contributor

there is a XPU support landed recently https://github.com/pytorch/pytorch/pull/147518/files

could you rebase and resolve conflict if any?

@weifengpy
Copy link
Contributor

hopefully the XPU PR makes your PR easier as well as they improved utils functions already

@AnantGulati
Copy link
Contributor

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/145521/head returned non-zero exit code 1

Rebasing (1/1)
Auto-merging test/distributed/fsdp/test_wrap.py
Auto-merging torch/testing/_internal/common_distributed.py
CONFLICT (content): Merge conflict in torch/testing/_internal/common_distributed.py
error: could not apply 422f307825a... General Changes for multi accelerators
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 422f307825a... General Changes for multi accelerators

Raised by https://github.com/pytorch/pytorch/actions/runs/13690560581

Copy link

linux-foundation-easycla bot commented Mar 6, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: rahulsingh-intel / name: RAHUL SINGH (679b12c)

@AnantGulati
Copy link
Contributor

@weifengpy Rebase has been performed successfully.
Could you please help with the approval while we re apply for the CLA
Thanks

@rahulsingh-intel
Copy link
Contributor Author

@weifengpy pls check

parametrize,
run_tests,
TEST_WITH_DEV_DBG_ASAN,
)


device_type = torch.device(get_devtype())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does get_dev_type do? It seems unlikely but what if there are two device types in a machine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get_dev_type will return a device type object corresponding to the device which we obtain using an if-else ladder defined in common_fsdp:

if TEST_CUDA: DEVICE_TYPE = "cuda" DISTRIBUTED_BACKEND = "nccl" DEVICE_COUNT = torch.cuda.device_count() elif TEST_HPU: DEVICE_TYPE = "hpu:0" DISTRIBUTED_BACKEND = "hccl" elif TEST_XPU: DEVICE_TYPE = "xpu" DISTRIBUTED_BACKEND = "xccl" DEVICE_COUNT = torch.xpu.device_count() else: DEVICE_TYPE = "cpu" DISTRIBUTED_BACKEND = "gloo" DEVICE_COUNT = 1

At the moment this methodology does not support more than one device type on a given system

instantiate_parametrized_tests(TestFSDPIgnoredModules)

devices = ("cuda", "hpu")
instantiate_device_type_tests(TestFSDPIgnoredModules, globals(), only_for=devices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the test process will always instantiate double the tests (cuda, hpu) and then at runtime decide to skip the tests that don't match the current device? Wondering if this adds meaningful overhead, if we see more vendor backends in the devices list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. instantiate_device_type_test will instantiate tests only for device detected.
For example if the conditional instead was
devices = ("cpu","cuda", "hpu")
On a set-up which supports cuda, two test cases would be generated for each test case within the class:
test_case_1_cpu and test_case_1_cuda

In the case implemented here, only one test case, test_case_1_cuda would be generated

@guangyey
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased common_fsdp onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout common_fsdp && git pull --rebase)

@rahulsingh-intel
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased common_fsdp onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout common_fsdp && git pull --rebase)

@AnantGulati
Copy link
Contributor

@wconstab @albanD The errors seem to be unrelated. Could you please help with this approval. All comments have been addressed

Thanks

)


device_type = torch.device(get_devtype())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.accelerator is designed for code generalization, we'd better use torch.accelerator.current_accelerator().type to get device_type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Torch.accelerator does not support HPU as of now, hence we have used the approach demonstrated above as it supports allows for easy addition of device types

@AnantGulati
Copy link
Contributor

@wconstab @kwen2501 @albanD @daisyden
Could you please help with approval for this PR. All comments have been addressed and the CI failures seem to be unrelated
Thanks

@albanD
Copy link
Collaborator

albanD commented Apr 14, 2025

The CI failures are relevant I think :)

Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Jun 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (fsdp) release notes category Stale triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.