-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
feat(ml): rocm #16613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ml): rocm #16613
Conversation
| suffix: ["", "-cuda", "-openvino", "-armnn"] | ||
| suffix: ['', '-cuda', '-rocm', '-openvino', '-armnn'] | ||
| steps: | ||
| - name: Login to GitHub Container Registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some changes in indentation as well as changes from double quote to single quote. Was this intended? I know it's from the first commit from the original PR but I don't think that was addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VS Code did this when I saved. I'm not sure why it's different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a PR check that runs prettier on the workflow files? I would think the inconsistency exists because there likely isn't.
zackpollard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Docker cache appears working with no changes, would you mind changing something within ML itself that would require a source code change and rebuild, just so we can see the cache working in those cases before we merge?
|
FYI, there's a set of rocm builds available supporting a wider range of AMD hardware, which might be useful: |
"ROCM SDK Builder 6.1.2 is based on to ROCM 6.1.2" |
|
Sadly, no, not quite. Official ROCm does not support, for instance, gfx1103 (RX 780M and similar iGPUs, 7940HS and similar APUs). |
The official listed support in the docs is mostly just gfx103X and gfx110X and maybe some other stuff. They're inconsistent and define supported as our team will help you on GitHub with certain stuff but anything not on the list may work (eg. Vega GPUs work fine) but they won't help you. Edit: So my question would be, how does one check what's supported by the build they are running? |
Yeah, but the official ROCm build will not work with gfx1103 at all, applications built against it (i.e. pytorch prebuilt) will not work with gfx1103, and building against it for gfx1103 will not work either.
I'm not quite sure. On Fedora, the gfx1103 build is provided as a separate package and listed as a separate folder, but the officially supported gfx1102 falls under gfx1100 here, so it's not a reliable check: |
|
Maybe it would be useful to have two rocm flavored options? One with the current main rocm version, and one with the community version built to support a wider variety of GPUs? |
Nice, they split them up by version. Eventually we want to do that to cut down the 30 GB image size. Frigate also splits them up. The current image we build has multiple versions all built into one image. |
|
Doing that would also resolve the issue of "official or unofficial build?" I suppose, since you can just provide the official builds for the supported GPUs and the unofficial builds for the non-supported GPUs. But you'd need to provide a lot of images that way. Edit: FYI: |
machine-learning/Dockerfile
Outdated
|
|
||
| WORKDIR /code | ||
|
|
||
| RUN apt-get update && apt-get install -y --no-install-recommends wget git python3.10-venv migraphx migraphx-dev half |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| RUN apt-get update && apt-get install -y --no-install-recommends wget git python3.10-venv migraphx migraphx-dev half | |
| RUN apt-get update && apt-get install -y --no-install-recommends wget git python3.10-venv migraphx-dev |
Only migraphx-dev is needed as the other 2 are dependencies.
Edit: don't change it now, though, because it's already building.
machine-learning/Dockerfile
Outdated
| /opt/ann/build.sh \ | ||
| /opt/armnn/ | ||
|
|
||
| FROM rocm/dev-ubuntu-22.04:6.3.4-complete AS prod-rocm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there were already comments on this, but I think copying the deps manually may result in a smaller, yet still working image. It might be worth re-investigating.
|
|
||
| # Warning: 25GiB+ disk space required to pull this image | ||
| # TODO: find a way to reduce the image size | ||
| FROM rocm/dev-ubuntu-22.04:6.3.4-complete AS builder-rocm |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. Not it.
Fedora rocBLAS patch for gfx1103 support looks like copy of gfx1102 (navi33). Only names and ISA versions differ. I diffed changes betwen few files and think that theese are only diferences. I'm intrested in additional gpu support because I have minipc with Ryzen8845HS (Radeon 780M) for testing, and second one with Ryzen5825U. |
My 780m locks up my desktop roughly 50% of the time when using ROCm llama.cpp/whisper.cpp with any ROCm version (1100, 1102, 1103). I'd hoped it would be less of an issue headless or with different applications, but if you have the same issue with Immich that does not bode well... |
This is not a valid version from what I've observed. So far, there are only 3 valid options: |
Unfortunately adding support for gfx1102 dosen't solve problems with crashing on Radeon 780M, but I'm happy because I succeeded getting it to work on Ryzen 5825U GPU. |
They also specifically say certain iGPUs crash. I would bet that they're just bleading edge.
That model or similar is known to work. |
ROCm which is in image created in this PR has compiled for arch which are below so 11.0.2 is valid option because this means gfx1102. Below some direcrory listing from image. Without patch to onxruntime HSA_OVERRIDE_GFX_VERSION=9.0.0 isn't a valid option in immich-machine-learning because this arch isn't compiled by default. |
| 1. If you do not already have it, download the latest [`hwaccel.ml.yml`][hw-file] file and ensure it's in the same folder as the `docker-compose.yml`. | ||
| 2. In the `docker-compose.yml` under `immich-machine-learning`, uncomment the `extends` section and change `cpu` to the appropriate backend. | ||
| 3. Still in `immich-machine-learning`, add one of -[armnn, cuda, openvino] to the `image` section's tag at the end of the line. | ||
| 3. Still in `immich-machine-learning`, add one of -[armnn, cuda, rocm, openvino] to the `image` section's tag at the end of the line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we forget to add rknn here? oops
* feat(ml): introduce support of onnxruntime-rocm for AMD GPU * try mutex for algo cache use OrtMutex * bump versions, run on mich use 3.12 use 1.19.2 * acquire lock before any changes can be made guard algo benchmark results mark mutex as mutable re-add /bin/sh (?) use 3.10 use 6.1.2 * use composite cache key 1.19.2 fix variable name fix variable reference aaaaaaaaaaaaaaaaaaaa * bump deps * disable algo caching * fix gha * try ubuntu runner * actually fix the gha * update patch * skip mimalloc preload for rocm * increase build threads * increase timeout for rocm * Revert "increase timeout for rocm" This reverts commit 2c4452f. * attempt migraphx * set migraphx_home * Revert "set migraphx_home" This reverts commit c121d3e. * Revert "attempt migraphx" This reverts commit 521f9fb. * migraphx, take two * bump rocm * allow cpu * try only targeting migraphx * skip tests * migraph ❌ * known issues * target gfx900 and gfx1102 * mention `HSA_USE_SVM` * update lock * set device id for rocm --------- Co-authored-by: Mehdi GHESH <[email protected]>
|
Hi @przemekbialek, you wrote
I'm using the Ryzen PRO 8845HS with Radeon 780M (gfx1103). Unfortunately, it doesn't work for me using immich-machine-learning:v1.141.1-rocm and the process fails with: Without the override, I get @NicholasFlamy would it be possible to upgrade the base image to Regarding ROCm update:
|
It's absolutely possible. Edit: I made a PR, it'll build an image that you can try out to see if it works. #21924 I'll have to test it before this can be merged. |
So, to try the image, you can replace the ML image with this one: |
Thanks a lot! Unfortunately, it doesn't make a difference for me. No matter what override value I use, I keep getting a |
Are you doing the GFX override? If not, make sure to try it. If it still doesn't work, make a new issue and @ me in it. |

Description
This PR introduces support for AMD GPUs through ROCm. It's a rebased version of #11063 with updated dependencies.
It also once again removes algo caching, as the concurrency issue with caching seems to be more subtle than originally thought. While disabling caching is wasteful (it essentially runs a benchmark every time instead of only once), it's still better than the current alternative of either lowering concurrency to 1 or not having ROCm support.