General Changes for multi accelerators #145521

rahulsingh-intel · 2025-01-23T20:17:22Z

Intend to generailze the framework for multiple accelerators.
Major changes includes:

Add TEST_CUDA & TEST_HPU condition for generalization at common place.
Move ".cuda()" to ".to(device_type)" call

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot · 2025-01-23T20:17:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145521

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit 679b12c with merge base 2a9e737 ():

NEW FAILURES - The following jobs have failed:

pull / linux-jammy-py3.9-gcc11 / test (backwards_compat, 1, 1, lf.ephemeral.linux.2xlarge) (gh)
test_modules_can_be_imported
pull / linux-jammy-py3.9-gcc11 / test (distributed, 2, 2, lf.ephemeral.linux.2xlarge) (gh)
distributed/pipelining/test_schedule.py::TestScheduleLowering::test_grad_with_split_b_w

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (trunk failure)
Process completed with exit code 1.

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.ephemeral.linux.2xlarge) (gh) (#144480)
backends/xnnpack/test/passes/test_convert_to_linear.py::TestConvertToLinear::test_fp32_convert_to_linear

This comment was automatically generated by Dr. CI and updates every 15 minutes.

rahulsingh-intel · 2025-01-24T16:59:52Z

Hi @kwen2501 , please review the changes. Thanks.

wconstab · 2025-01-27T18:35:29Z

torch/testing/_internal/common_distributed.py

+                torch.cuda.nccl.version() < version,
+                f"Requires NCCL version greater than or equal to: {version}, found: {torch.cuda.nccl.version()}, reason: {msg}",
+            )
+    return lambda func: func


can you explain this change? What does return lambda func: func do?

What is the desired behavior here? If TEST_CUDA is not enabled, do you want this macro to skip_but_pass? or you want the test to run? a bit more description of intended behavior in the PR desc would be helpful.

requires_nccl decorator is only being used for CUDA environment so skipping it in non cuda enviroment using lambda func: func which return a no-op decorator if TEST_CUDA is False.
So changed here in common place instead of modifying tests modules where it being called.
Thanks.

Return no-op decorator could make other platforms run a test that requires nccl and could cause a failure, it breaks the logic of requires_nccl().

ankurneog · 2025-02-03T02:28:09Z

test/distributed/fsdp/test_fsdp_hybrid_shard.py

+
+device_type = torch.device(get_devtype())
+
+DEVICE_COUNT = _get_device_module(device_type.type).device_count()


replace this with
torch.get_device_module(device_type.type).device_count()

pytorch/torch/__init__.py

Line 2683 in c0ec2e0

def get_device_module(device: _Optional[_Union[torch.device, str]] = None):

ankurneog · 2025-02-03T02:29:59Z

test/distributed/fsdp/test_fsdp_ignored_modules.py

    parametrize,
    run_tests,
    TEST_WITH_DEV_DBG_ASAN,
 )


+device_type = torch.device(get_devtype())
+
+DEVICE_COUNT = _get_device_module(device_type.type).device_count()


Replace this API with

pytorch/torch/__init__.py

Line 2683 in c0ec2e0

def get_device_module(device: _Optional[_Union[torch.device, str]] = None):

ankurneog · 2025-02-03T02:32:30Z

test/distributed/fsdp/test_fsdp_optim_state.py

    TEST_WITH_DEV_DBG_ASAN,
 )


+device_type = torch.device(get_devtype())
+
+DEVICE_COUNT = _get_device_module(device_type.type).device_count()


replace with

pytorch/torch/__init__.py

Line 2683 in c0ec2e0

def get_device_module(device: _Optional[_Union[torch.device, str]] = None):

ankurneog · 2025-02-03T02:33:11Z

test/distributed/fsdp/test_fsdp_state_dict.py



 class TestFSDPStateDict(FSDPTest):
    @property
    def world_size(self):
-        return min(torch.cuda.device_count(), 2)
+        return min(_get_device_module(device_type.type).device_count(), 2)


replace with

pytorch/torch/__init__.py

Line 2683 in c0ec2e0

def get_device_module(device: _Optional[_Union[torch.device, str]] = None):

i.e torch.get_device_module(device_type).device_count()

ankurneog · 2025-02-03T02:35:26Z

test/distributed/fsdp/test_fsdp_state_dict.py

@@ -1272,7 +1277,7 @@ def test_world_size_one(self):
 class TestFSDPStateDict4GPUs(FSDPTest):
    @property
    def world_size(self):
-        return torch.cuda.device_count()
+        return max(_get_device_module(device_type.type).device_count(), 2)


replace with torch.get_device_module(device_type.type).device_count()

pytorch/torch/__init__.py

Line 2683 in c0ec2e0

def get_device_module(device: _Optional[_Union[torch.device, str]] = None):

ankurneog · 2025-02-03T02:40:20Z

test/distributed/fsdp/test_shard_utils.py

 from torch.testing._internal.distributed._tensor.common_dtensor import (
    DTensorTestBase,
    skip_if_lt_x_gpu,
    with_comms,
 )


+device_type = torch.device(get_devtype())
+
+DEVICE_COUNT = _get_device_module(device_type.type).device_count()


replace with torch.get_device_module(device_type.type).device_count())

rahulsingh-intel · 2025-02-05T19:00:36Z

@pytorchbot rebase

pytorchmergebot · 2025-02-05T19:05:52Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-02-05T19:05:55Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/145521/head returned non-zero exit code 1

Rebasing (1/1)
Auto-merging test/distributed/fsdp/test_fsdp_grad_acc.py
Auto-merging test/distributed/fsdp/test_fsdp_optim_state.py
CONFLICT (content): Merge conflict in test/distributed/fsdp/test_fsdp_optim_state.py
error: could not apply d99f0ddf0d1... General Changes for multi accelerators
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply d99f0ddf0d1... General Changes for multi accelerators

Raised by https://github.com/pytorch/pytorch/actions/runs/13164618168

rahulsingh-intel · 2025-02-12T11:45:45Z

@pytorchbot rebase

pytorchmergebot · 2025-02-12T11:47:22Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-02-12T11:47:25Z

Successfully rebased common_fsdp onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout common_fsdp && git pull --rebase)

rahulsingh-intel · 2025-02-12T11:49:18Z

hi @kwen2501 please review.

AnantGulati · 2025-02-18T05:08:16Z

@albanD @wconstab @kwen2501 @EikanWang
Could you please help with this review it has been stuck for some time now

AnantGulati · 2025-02-20T06:20:21Z

Hi @H-Huang @albanD @wconstab @kwen2501 @fegin

Could you please help with this review. All comments have been addressed

Thanks

AnantGulati · 2025-02-24T07:09:56Z

@wconstab @albanD @awgu @kwen2501 @fegin @H-Huang @c-p-i-o
Could you please help with this PR approval. All comments have been addressed
Thanks

EikanWang · 2025-02-24T07:51:13Z

@zhangxiaoli73 , could you help check if your FSDP related PR has covered some of this PR?

zhangxiaoli73 · 2025-02-24T08:47:36Z

@zhangxiaoli73 , could you help check if your FSDP related PR has covered some of this PR?

My PR is leveraging generalized tests from Anant to enhance XCCL backend support on XPU. So there is no conflict and duplication now.

AnantGulati · 2025-02-26T15:22:32Z

"Try to land this since the failure is unrelated."
@pytorchbot merge

pytorch-bot · 2025-02-26T15:22:37Z

This PR needs to be approved by an authorized maintainer before merge.

AnantGulati · 2025-02-26T15:24:29Z

@albanD @wconstab @kwen2501 @fegin
All comments have been addressed and the CI failure appears to be unrelated. Could you please help with the approval for this PR

AnantGulati · 2025-02-27T04:45:27Z

@weifengpy Could you please review this PR

AnantGulati · 2025-03-03T05:05:38Z

@wconstab @albanD
Could you please help with approval for this PR.
All comments have been addressed and the error seems to be unrelated
Thanks

weifengpy · 2025-03-05T23:59:23Z

there is a XPU support landed recently https://github.com/pytorch/pytorch/pull/147518/files

could you rebase and resolve conflict if any?

weifengpy · 2025-03-06T00:00:24Z

hopefully the XPU PR makes your PR easier as well as they improved utils functions already

AnantGulati · 2025-03-06T03:30:05Z

@pytorchbot rebase

pytorchmergebot · 2025-03-06T03:31:28Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-03-06T03:31:30Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/145521/head returned non-zero exit code 1

Rebasing (1/1)
Auto-merging test/distributed/fsdp/test_wrap.py
Auto-merging torch/testing/_internal/common_distributed.py
CONFLICT (content): Merge conflict in torch/testing/_internal/common_distributed.py
error: could not apply 422f307825a... General Changes for multi accelerators
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 422f307825a... General Changes for multi accelerators

Raised by https://github.com/pytorch/pytorch/actions/runs/13690560581

linux-foundation-easycla · 2025-03-06T04:25:45Z

The committers listed above are authorized under a signed CLA.

✅ login: rahulsingh-intel / name: RAHUL SINGH (679b12c)

AnantGulati · 2025-03-06T04:34:00Z

@weifengpy Rebase has been performed successfully.
Could you please help with the approval while we re apply for the CLA
Thanks

rahulsingh-intel · 2025-03-07T11:16:37Z

@weifengpy pls check

wconstab · 2025-03-07T14:50:15Z

test/distributed/fsdp/test_fsdp_ignored_modules.py

    parametrize,
    run_tests,
    TEST_WITH_DEV_DBG_ASAN,
 )


+device_type = torch.device(get_devtype())


What does get_dev_type do? It seems unlikely but what if there are two device types in a machine?

Get_dev_type will return a device type object corresponding to the device which we obtain using an if-else ladder defined in common_fsdp:

if TEST_CUDA: DEVICE_TYPE = "cuda" DISTRIBUTED_BACKEND = "nccl" DEVICE_COUNT = torch.cuda.device_count() elif TEST_HPU: DEVICE_TYPE = "hpu:0" DISTRIBUTED_BACKEND = "hccl" elif TEST_XPU: DEVICE_TYPE = "xpu" DISTRIBUTED_BACKEND = "xccl" DEVICE_COUNT = torch.xpu.device_count() else: DEVICE_TYPE = "cpu" DISTRIBUTED_BACKEND = "gloo" DEVICE_COUNT = 1

At the moment this methodology does not support more than one device type on a given system

wconstab · 2025-03-07T14:52:19Z

test/distributed/fsdp/test_fsdp_ignored_modules.py

-instantiate_parametrized_tests(TestFSDPIgnoredModules)
-
+devices = ("cuda", "hpu")
+instantiate_device_type_tests(TestFSDPIgnoredModules, globals(), only_for=devices)


Does this mean the test process will always instantiate double the tests (cuda, hpu) and then at runtime decide to skip the tests that don't match the current device? Wondering if this adds meaningful overhead, if we see more vendor backends in the devices list.

No. instantiate_device_type_test will instantiate tests only for device detected.
For example if the conditional instead was
devices = ("cpu","cuda", "hpu")
On a set-up which supports cuda, two test cases would be generated for each test case within the class:
test_case_1_cpu and test_case_1_cuda

In the case implemented here, only one test case, test_case_1_cuda would be generated

guangyey · 2025-03-17T14:50:11Z

@pytorchbot rebase

pytorchmergebot · 2025-03-17T14:51:42Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-03-17T14:51:47Z

Successfully rebased common_fsdp onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout common_fsdp && git pull --rebase)

rahulsingh-intel · 2025-03-25T05:20:48Z

@pytorchbot rebase

pytorchmergebot · 2025-03-25T05:22:17Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-03-25T05:22:21Z

Successfully rebased common_fsdp onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout common_fsdp && git pull --rebase)

AnantGulati · 2025-03-25T05:22:48Z

@wconstab @albanD The errors seem to be unrelated. Could you please help with this approval. All comments have been addressed

Thanks

daisyden · 2025-04-01T02:28:46Z

test/distributed/fsdp/test_fsdp_comm_hooks.py

 )


+device_type = torch.device(get_devtype())


torch.accelerator is designed for code generalization, we'd better use torch.accelerator.current_accelerator().type to get device_type?

Torch.accelerator does not support HPU as of now, hence we have used the approach demonstrated above as it supports allows for easy addition of device types

AnantGulati · 2025-04-14T06:47:57Z

@wconstab @kwen2501 @albanD @daisyden
Could you please help with approval for this PR. All comments have been addressed and the CI failures seem to be unrelated
Thanks

albanD · 2025-04-14T15:22:06Z

The CI failures are relevant I think :)

github-actions · 2025-06-13T15:37:25Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jan 23, 2025

pytorchbot added the open source label Jan 23, 2025

cpuhrsch requested review from albanD and wconstab January 24, 2025 03:46

cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 24, 2025

wconstab reviewed Jan 27, 2025

View reviewed changes

ankurneog reviewed Feb 3, 2025

View reviewed changes

rahulsingh-intel force-pushed the common_fsdp branch from c6a00a9 to 695c837 Compare February 11, 2025 19:11

pytorchmergebot force-pushed the common_fsdp branch from 695c837 to 422f307 Compare February 12, 2025 11:47

ankurneog mentioned this pull request Mar 5, 2025

[RFC] Generalize pytorch content for non-native device execution pytorch/rfcs#66

Open

rahulsingh-intel force-pushed the common_fsdp branch from 61a594d to 8da6c6f Compare March 7, 2025 10:57

wconstab reviewed Mar 7, 2025

View reviewed changes

pytorchmergebot force-pushed the common_fsdp branch from 8da6c6f to a076a74 Compare March 17, 2025 14:51

General Changes for multi accelerators

679b12c

pytorchmergebot force-pushed the common_fsdp branch from a076a74 to 679b12c Compare March 25, 2025 05:22

daisyden reviewed Apr 1, 2025

View reviewed changes

github-actions bot added the Stale label Jun 13, 2025


		device_type = torch.device(get_devtype())

		DEVICE_COUNT = _get_device_module(device_type.type).device_count()

General Changes for multi accelerators #145521

Are you sure you want to change the base?

General Changes for multi accelerators #145521

Uh oh!

Conversation

rahulsingh-intel commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145521

❌ 2 New Failures, 2 Unrelated Failures

Uh oh!

rahulsingh-intel commented Jan 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankurneog Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahulsingh-intel commented Feb 5, 2025

Uh oh!

pytorchmergebot commented Feb 5, 2025

Uh oh!

pytorchmergebot commented Feb 5, 2025

Uh oh!

rahulsingh-intel commented Feb 12, 2025

Uh oh!

pytorchmergebot commented Feb 12, 2025

Uh oh!

pytorchmergebot commented Feb 12, 2025

Uh oh!

rahulsingh-intel commented Feb 12, 2025

Uh oh!

AnantGulati commented Feb 18, 2025

Uh oh!

AnantGulati commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AnantGulati commented Feb 24, 2025

Uh oh!

EikanWang commented Feb 24, 2025

Uh oh!

zhangxiaoli73 commented Feb 24, 2025

Uh oh!

AnantGulati commented Feb 26, 2025

Uh oh!

pytorch-bot bot commented Feb 26, 2025

Uh oh!

AnantGulati commented Feb 26, 2025

Uh oh!

AnantGulati commented Feb 27, 2025

Uh oh!

AnantGulati commented Mar 3, 2025

Uh oh!

weifengpy commented Mar 5, 2025

Uh oh!

weifengpy commented Mar 6, 2025

Uh oh!

AnantGulati commented Mar 6, 2025

Uh oh!

pytorchmergebot commented Mar 6, 2025

Uh oh!

pytorchmergebot commented Mar 6, 2025

rahulsingh-intel commented Jan 23, 2025 •

edited

Loading

pytorch-bot bot commented Jan 23, 2025 •

edited

Loading

ankurneog Feb 3, 2025 •

edited

Loading

AnantGulati commented Feb 20, 2025 •

edited

Loading

linux-foundation-easycla bot commented Mar 6, 2025 •

edited

Loading