Fixed autoparallel integration tests on ROCm.#2321
Conversation
tianyu-l
left a comment
There was a problem hiding this comment.
HIP runtime relies on the file /opt/amdgpu/share/libdrm/amdgpu.ids to look up the product name of AMDGPU. But this file is missing in the CI docker container.
I don't think we have the urgency to enable autoparallel CI on rocm. I don't mind waiting until a proper fix (in docker, pytorch, or test_infra) is available, compared with landing this workaround.
02c7571 to
13daade
Compare
|
Warning: Unknown label
Please add the new label to .github/pytorch-probot.yml |
Hi @tianyu-l . I talked to our pytorch rocm team about this issue. Using |
13daade to
55dc1c1
Compare
|
Warning: Unknown label
Please add the new label to .github/pytorch-probot.yml |
This PR fixes the autoparallel integration tests on ROCm. Background: HIP runtime relies on the file /opt/amdgpu/share/libdrm/amdgpu.ids to look up the product name of AMDGPU. But this file is missing in the CI docker container. Only `/usr/share/libdrm/amdgpu.ids` exists in the docker container and it is out of date - it does not include newer products like MI300. One way to address this is to use `-v /opt/amdgpu:/opt/amdgpu:ro` when launching the docker container to map the /opt/amdgpu/share/libdrm/amdgpu.ids on the host to the docker container. This, however, requires changes of the workflows files in pytorch/test_infra and pytorch/pytorch. This quick fix works around the issue by updating the amdgpu.ids file within the CI docker container.
This PR fixes the autoparallel integration tests on ROCm. Background: HIP runtime relies on the file /opt/amdgpu/share/libdrm/amdgpu.ids to look up the product name of AMDGPU. But this file is missing in the CI docker container. Only `/usr/share/libdrm/amdgpu.ids` exists in the docker container and it is out of date - it does not include newer products like MI300. One way to address this is to use `-v /opt/amdgpu:/opt/amdgpu:ro` when launching the docker container to map the /opt/amdgpu/share/libdrm/amdgpu.ids on the host to the docker container. This, however, requires changes of the workflows files in pytorch/test_infra and pytorch/pytorch. This quick fix works around the issue by updating the amdgpu.ids file within the CI docker container.
This PR fixes the autoparallel integration tests on ROCm.
Background:
HIP runtime relies on the file /opt/amdgpu/share/libdrm/amdgpu.ids to look up the product name of AMDGPU. But this file is missing in the CI docker container. Only
/usr/share/libdrm/amdgpu.idsexists in the docker container and it is out of date - it does not include newer products like MI300. One way to address this is to use-v /opt/amdgpu:/opt/amdgpu:rowhen launching the docker container to map the /opt/amdgpu/share/libdrm/amdgpu.ids on the host to the docker container. This, however, requires changes of the workflows files in pytorch/test_infra and pytorch/pytorch. This quick fix works around the issue by updating the amdgpu.ids file within the CI docker container.