-
Notifications
You must be signed in to change notification settings - Fork 24.6k
Refactoring Distributed test cases to be device agnostic [1/n] #145222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring Distributed test cases to be device agnostic [1/n] #145222
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145222
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 2 New Failures, 2 Unrelated FailuresAs of commit 908427d with merge base 106acf0 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@kwen2501 could you please help land this PR |
The most recent commit changes the way cuda is set in DIstributedTestBase. Here we explicitly pass |
Changing the device setting logic from using |
@kwen2501 Could you please help land this PR |
"Try to land this since the failure is unrelated." |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: trunk / linux-focal-rocm6.3-py3.10 / test (default, 1, 2, linux.rocm.gpu.2), trunk / linux-focal-rocm6.3-py3.10 / test (default, 2, 2, linux.rocm.gpu.2) Details for Dev Infra teamRaised by workflow job |
"Try to land this since the failure is unrelated." |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: trunk / linux-focal-rocm6.3-py3.10 / test (default, 1, 2, linux.rocm.gpu.2), trunk / linux-focal-rocm6.3-py3.10 / test (default, 2, 2, linux.rocm.gpu.2) Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 4 checks: pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 2, 3, lf.linux.2xlarge), trunk / linux-focal-rocm6.3-py3.10 / test (default, 1, 2, linux.rocm.gpu.2), trunk / linux-focal-rocm6.3-py3.10 / test (default, 2, 2, linux.rocm.gpu.2), trunk / linux-focal-rocm6.3-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ch#145222) In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic. These changes will include the following approaches to do the same : - Allowing for multiple device types using instantiate_device_type_test - Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies - Skipping set up steps required while using MultiProcessTestCase with DistributedTestBase (pytorch#138216) wherever applicable - Replacing explicit calls to distributed backend (NCCL,HCCL,etc) with get_default_backend_for_device (pytorch#140536). This should result in significant improvement in usability for all devices Pull Request resolved: pytorch#145222 Approved by: https://github.com/kwen2501
In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic.
These changes will include the following approaches to do the same :
This should result in significant improvement in usability for all devices
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @mcarilli @ptrblck @leslie-fang-intel @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn @ekr0 @xmfan