-
Notifications
You must be signed in to change notification settings - Fork 24.6k
General Changes for multi accelerators #145521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145521
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 2 Unrelated FailuresAs of commit 679b12c with merge base 2a9e737 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @kwen2501 , please review the changes. Thanks. |
torch.cuda.nccl.version() < version, | ||
f"Requires NCCL version greater than or equal to: {version}, found: {torch.cuda.nccl.version()}, reason: {msg}", | ||
) | ||
return lambda func: func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain this change? What does return lambda func: func
do?
What is the desired behavior here? If TEST_CUDA is not enabled, do you want this macro to skip_but_pass? or you want the test to run? a bit more description of intended behavior in the PR desc would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requires_nccl decorator is only being used for CUDA environment so skipping it in non cuda enviroment using lambda func: func which return a no-op decorator if TEST_CUDA is False.
So changed here in common place instead of modifying tests modules where it being called.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return no-op decorator could make other platforms run a test that requires nccl and could cause a failure, it breaks the logic of requires_nccl().
|
||
device_type = torch.device(get_devtype()) | ||
|
||
DEVICE_COUNT = _get_device_module(device_type.type).device_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace this with
torch.get_device_module(device_type.type).device_count()
Line 2683 in c0ec2e0
def get_device_module(device: _Optional[_Union[torch.device, str]] = None): |
parametrize, | ||
run_tests, | ||
TEST_WITH_DEV_DBG_ASAN, | ||
) | ||
|
||
|
||
device_type = torch.device(get_devtype()) | ||
|
||
DEVICE_COUNT = _get_device_module(device_type.type).device_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace this API with
Line 2683 in c0ec2e0
def get_device_module(device: _Optional[_Union[torch.device, str]] = None): |
TEST_WITH_DEV_DBG_ASAN, | ||
) | ||
|
||
|
||
device_type = torch.device(get_devtype()) | ||
|
||
DEVICE_COUNT = _get_device_module(device_type.type).device_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with
Line 2683 in c0ec2e0
def get_device_module(device: _Optional[_Union[torch.device, str]] = None): |
|
||
|
||
class TestFSDPStateDict(FSDPTest): | ||
@property | ||
def world_size(self): | ||
return min(torch.cuda.device_count(), 2) | ||
return min(_get_device_module(device_type.type).device_count(), 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with
Line 2683 in c0ec2e0
def get_device_module(device: _Optional[_Union[torch.device, str]] = None): |
i.e
torch.get_device_module(device_type).device_count()
@@ -1272,7 +1277,7 @@ def test_world_size_one(self): | |||
class TestFSDPStateDict4GPUs(FSDPTest): | |||
@property | |||
def world_size(self): | |||
return torch.cuda.device_count() | |||
return max(_get_device_module(device_type.type).device_count(), 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with torch.get_device_module(device_type.type).device_count()
Line 2683 in c0ec2e0
def get_device_module(device: _Optional[_Union[torch.device, str]] = None): |
from torch.testing._internal.distributed._tensor.common_dtensor import ( | ||
DTensorTestBase, | ||
skip_if_lt_x_gpu, | ||
with_comms, | ||
) | ||
|
||
|
||
device_type = torch.device(get_devtype()) | ||
|
||
DEVICE_COUNT = _get_device_module(device_type.type).device_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with torch.get_device_module(device_type.type).device_count())
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Rebase failed due to Command
Raised by https://github.com/pytorch/pytorch/actions/runs/13164618168 |
c6a00a9
to
695c837
Compare
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
695c837
to
422f307
Compare
hi @kwen2501 please review. |
@albanD @wconstab @kwen2501 @EikanWang |
@zhangxiaoli73 , could you help check if your FSDP related PR has covered some of this PR? |
My PR is leveraging generalized tests from Anant to enhance XCCL backend support on XPU. So there is no conflict and duplication now. |
"Try to land this since the failure is unrelated." |
This PR needs to be approved by an authorized maintainer before merge. |
@weifengpy Could you please review this PR |
there is a XPU support landed recently https://github.com/pytorch/pytorch/pull/147518/files could you rebase and resolve conflict if any? |
hopefully the XPU PR makes your PR easier as well as they improved utils functions already |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Rebase failed due to Command
Raised by https://github.com/pytorch/pytorch/actions/runs/13690560581 |
|
@weifengpy Rebase has been performed successfully. |
61a594d
to
8da6c6f
Compare
@weifengpy pls check |
parametrize, | ||
run_tests, | ||
TEST_WITH_DEV_DBG_ASAN, | ||
) | ||
|
||
|
||
device_type = torch.device(get_devtype()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does get_dev_type do? It seems unlikely but what if there are two device types in a machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get_dev_type will return a device type object corresponding to the device which we obtain using an if-else ladder defined in common_fsdp:
if TEST_CUDA: DEVICE_TYPE = "cuda" DISTRIBUTED_BACKEND = "nccl" DEVICE_COUNT = torch.cuda.device_count() elif TEST_HPU: DEVICE_TYPE = "hpu:0" DISTRIBUTED_BACKEND = "hccl" elif TEST_XPU: DEVICE_TYPE = "xpu" DISTRIBUTED_BACKEND = "xccl" DEVICE_COUNT = torch.xpu.device_count() else: DEVICE_TYPE = "cpu" DISTRIBUTED_BACKEND = "gloo" DEVICE_COUNT = 1
At the moment this methodology does not support more than one device type on a given system
instantiate_parametrized_tests(TestFSDPIgnoredModules) | ||
|
||
devices = ("cuda", "hpu") | ||
instantiate_device_type_tests(TestFSDPIgnoredModules, globals(), only_for=devices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the test process will always instantiate double the tests (cuda, hpu) and then at runtime decide to skip the tests that don't match the current device? Wondering if this adds meaningful overhead, if we see more vendor backends in the devices list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. instantiate_device_type_test will instantiate tests only for device detected.
For example if the conditional instead was
devices = ("cpu","cuda", "hpu")
On a set-up which supports cuda, two test cases would be generated for each test case within the class:
test_case_1_cpu
and test_case_1_cuda
In the case implemented here, only one test case, test_case_1_cuda
would be generated
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
8da6c6f
to
a076a74
Compare
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
a076a74
to
679b12c
Compare
) | ||
|
||
|
||
device_type = torch.device(get_devtype()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.accelerator is designed for code generalization, we'd better use torch.accelerator.current_accelerator().type to get device_type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Torch.accelerator does not support HPU as of now, hence we have used the approach demonstrated above as it supports allows for easy addition of device types
The CI failures are relevant I think :) |
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Intend to generailze the framework for multiple accelerators.
Major changes includes:
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o