Skip to content

Fixed autoparallel integration tests on ROCm.#2321

Merged
tianyu-l merged 1 commit into
pytorch:mainfrom
ROCm:FIX/fix_autoparallel_test_rocm
Feb 17, 2026
Merged

Fixed autoparallel integration tests on ROCm.#2321
tianyu-l merged 1 commit into
pytorch:mainfrom
ROCm:FIX/fix_autoparallel_test_rocm

Conversation

@wenchenvincent

@wenchenvincent wenchenvincent commented Feb 4, 2026

Copy link
Copy Markdown
Collaborator

This PR fixes the autoparallel integration tests on ROCm.

Background:
HIP runtime relies on the file /opt/amdgpu/share/libdrm/amdgpu.ids to look up the product name of AMDGPU. But this file is missing in the CI docker container. Only /usr/share/libdrm/amdgpu.ids exists in the docker container and it is out of date - it does not include newer products like MI300. One way to address this is to use -v /opt/amdgpu:/opt/amdgpu:ro when launching the docker container to map the /opt/amdgpu/share/libdrm/amdgpu.ids on the host to the docker container. This, however, requires changes of the workflows files in pytorch/test_infra and pytorch/pytorch. This quick fix works around the issue by updating the amdgpu.ids file within the CI docker container.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 4, 2026
@wenchenvincent wenchenvincent marked this pull request as draft February 4, 2026 05:39
@wenchenvincent wenchenvincent marked this pull request as ready for review February 5, 2026 01:05

@tianyu-l tianyu-l left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIP runtime relies on the file /opt/amdgpu/share/libdrm/amdgpu.ids to look up the product name of AMDGPU. But this file is missing in the CI docker container.

I don't think we have the urgency to enable autoparallel CI on rocm. I don't mind waiting until a proper fix (in docker, pytorch, or test_infra) is available, compared with landing this workaround.

@pytorch-bot

pytorch-bot Bot commented Feb 16, 2026

Copy link
Copy Markdown

Warning: Unknown label ciflow/rocm-mi300.
Currently recognized labels are

  • ciflow/8gpu

Please add the new label to .github/pytorch-probot.yml

@wenchenvincent

Copy link
Copy Markdown
Collaborator Author

HIP runtime relies on the file /opt/amdgpu/share/libdrm/amdgpu.ids to look up the product name of AMDGPU. But this file is missing in the CI docker container.

I don't think we have the urgency to enable autoparallel CI on rocm. I don't mind waiting until a proper fix (in docker, pytorch, or test_infra) is available, compared with landing this workaround.

Hi @tianyu-l . I talked to our pytorch rocm team about this issue. Using -v /opt/amdgpu:/opt/amdgpu:ro when launching the docker container isn't a good option either as proprietary and open-source amdgpu driver uses different paths. Pytorch ROCm heavyweight wheels actually packaged its own copy of libdrm and amdgpu.ids file to address this issue. However, it was broken silently due to an update in libdrm. There was a recent PR in pytorch to fix this: pytorch/pytorch#174811 and I verified it with the latest nightly wheel (https://download.pytorch.org/whl/nightly/rocm7.1/torch-2.11.0.dev20260215%2Brocm7.1-cp312-cp312-manylinux_2_28_x86_64.whl). I updated this PR to just enable the autoparallel test and the test should be passing when CI is re-triggered.

@pytorch-bot

pytorch-bot Bot commented Feb 16, 2026

Copy link
Copy Markdown

Warning: Unknown label ciflow/rocm.
Currently recognized labels are

  • ciflow/8gpu

Please add the new label to .github/pytorch-probot.yml

@tianyu-l tianyu-l merged commit 2ce22eb into pytorch:main Feb 17, 2026
24 checks passed
TXacs pushed a commit to McmillanTAC/torchtitan that referenced this pull request Apr 13, 2026
This PR fixes the autoparallel integration tests on ROCm.

Background:
HIP runtime relies on the file /opt/amdgpu/share/libdrm/amdgpu.ids to
look up the product name of AMDGPU. But this file is missing in the CI
docker container. Only `/usr/share/libdrm/amdgpu.ids` exists in the
docker container and it is out of date - it does not include newer
products like MI300. One way to address this is to use `-v
/opt/amdgpu:/opt/amdgpu:ro` when launching the docker container to map
the /opt/amdgpu/share/libdrm/amdgpu.ids on the host to the docker
container. This, however, requires changes of the workflows files in
pytorch/test_infra and pytorch/pytorch. This quick fix works around the
issue by updating the amdgpu.ids file within the CI docker container.
ACharacterInASimulation pushed a commit to ACharacterInASimulation/torchtitan that referenced this pull request Apr 21, 2026
This PR fixes the autoparallel integration tests on ROCm.

Background:
HIP runtime relies on the file /opt/amdgpu/share/libdrm/amdgpu.ids to
look up the product name of AMDGPU. But this file is missing in the CI
docker container. Only `/usr/share/libdrm/amdgpu.ids` exists in the
docker container and it is out of date - it does not include newer
products like MI300. One way to address this is to use `-v
/opt/amdgpu:/opt/amdgpu:ro` when launching the docker container to map
the /opt/amdgpu/share/libdrm/amdgpu.ids on the host to the docker
container. This, however, requires changes of the workflows files in
pytorch/test_infra and pytorch/pytorch. This quick fix works around the
issue by updating the amdgpu.ids file within the CI docker container.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants