Skip to content

Upgrade dependencies #2998

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

hanwen-cluster
Copy link
Contributor

@hanwen-cluster hanwen-cluster commented Jul 31, 2025

Description of changes

  • Upgrade Slurm to version 24.11.6 (from 24.05.8).
  • Upgrade EFA installer to 1.42.0 (from 1.41.0).
    • Efa-driver: efa-2.15.3-1
    • Efa-config: efa-config-1.18-1
    • Efa-profile: efa-profile-1.7-1
    • Libfabric-aws: libfabric-aws-2.1.0-3
    • Rdma-core: rdma-core-57.0-1
    • Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.6-11
  • Upgrade Cinc Client to version to 18.4.12 from 18.2.7.
  • Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except AL2.
  • Upgrade CUDA Toolkit to version 12.8.1 (from 12.8.0) for all OSs except AL2.
  • Upgrade DCGM to version 4.2.3 (from 3.3.6) for all OSs except AL2.
  • Upgrade Python to 3.12.11 (from 3.12.8) for all OSs except AL2.
  • Upgrade Intel MPI Library to 2021.16.0 (from 2021.13.1).

Among the above upgrade, DCGM is a major version upgrade (from version 3 to version 4) This is a new change in DCGM 4:

Installation assets are no longer shipped in a single monolithic package. Instead, installation assets have been split among several packages, allowing clients to opt-out of the installation of assets not applicable to their use case.

  Component packages are as follows:

      datacenter-gpu-manager-4-core

              Provides nv-hostengine binary and other CUDA-agnostic installation assets available through the DCGM open source product

      datacenter-gpu-manager-4-cuda11

              Provides the CUDA11-specific binaries available through the DCGM open source product

      datacenter-gpu-manager-4-cuda12

              Provides the CUDA12-specific binaries available through the DCGM open source product

      datacenter-gpu-manager-4-proprietary

              Provides CUDA-agnostic installation assets not distributed as part of the DCGM open source product

      datacenter-gpu-manager-4-proprietary-cuda11

              Provides CUDA11 binaries not distributed as part of the DCGM open source product

      datacenter-gpu-manager-4-proprietary-cuda12

              Provides CUDA12 binaries not distributed as part of the DCGM open source product

      datacenter-gpu-manager-4-development

              Provides files necessary for the development of downstream software dependent on the DCGM library

For ParallelCluster GPU health check use case, I verified that datacenter-gpu-manager-4-core and datacenter-gpu-manager-4-cuda12 are the minimal set of packages we need to install. I verified this by running GPU health check manually on a GPU instance. Missing any would cause errors.

https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html

Tests

  • Build image on all OSes (except Rocky) have been tested. We will test Rocky after the PR is merged.

References

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

- Upgrade Slurm to version 24.11.6 (from 24.05.8).
- Upgrade EFA installer to 1.42.0 (from 1.41.0).
  - Efa-driver: efa-2.15.3-1
  - Efa-config: efa-config-1.18-1
  - Efa-profile: efa-profile-1.7-1
  - Libfabric-aws: libfabric-aws-2.1.0-3
  - Rdma-core: rdma-core-57.0-1
  - Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.6-11
- Upgrade Cinc Client to version to 18.4.12 from 18.2.7.
- Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except AL2.
- Upgrade CUDA Toolkit to version 12.8.1 (from 12.8.0) for all OSs except AL2.
- Upgrade DCGM to version 4.2.3 (from 3.3.6) for all OSs except AL2.
- Upgrade Python to 3.12.11 (from 3.12.8) for all OSs except AL2.
- Upgrade Intel MPI Library to 2021.16.0 (from 2021.13.1).

Among the above upgrade, DCGM is a major version upgrade (from version 3 to version 4)
This is a new change in DCGM 4:
```
Installation assets are no longer shipped in a single monolithic package. Instead, installation assets have been split among several packages, allowing clients to opt-out of the installation of assets not applicable to their use case.

  Component packages are as follows:

      datacenter-gpu-manager-4-core

              Provides nv-hostengine binary and other CUDA-agnostic installation assets available through the DCGM open source product

      datacenter-gpu-manager-4-cuda11

              Provides the CUDA11-specific binaries available through the DCGM open source product

      datacenter-gpu-manager-4-cuda12

              Provides the CUDA12-specific binaries available through the DCGM open source product

      datacenter-gpu-manager-4-proprietary

              Provides CUDA-agnostic installation assets not distributed as part of the DCGM open source product

      datacenter-gpu-manager-4-proprietary-cuda11

              Provides CUDA11 binaries not distributed as part of the DCGM open source product

      datacenter-gpu-manager-4-proprietary-cuda12

              Provides CUDA12 binaries not distributed as part of the DCGM open source product

      datacenter-gpu-manager-4-development

              Provides files necessary for the development of downstream software dependent on the DCGM library

```
https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html

Signed-off-by: Hanwen <[email protected]>

Signed-off-by: Hanwen <[email protected]>
gmarciani
gmarciani previously approved these changes Aug 7, 2025
@gmarciani gmarciani enabled auto-merge (rebase) August 7, 2025 13:45
auto-merge was automatically disabled August 7, 2025 14:35

Rebase failed

@gmarciani
Copy link
Contributor

Closed in favor of #3000

@gmarciani gmarciani closed this Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants