Skip to content

Conversation

@mdboom
Copy link
Contributor

@mdboom mdboom commented Dec 17, 2025

(Marked as draft as a reminder to not merge until after the 0.5.0 release...)

Prerequisites to get this PR to pass:

This is the first landing of cuda.core.system, with all of the features in the nvutil prototype (which sort of has an arbitrary collection of the most core things in NVML, but is a reasonable starting point for a first PR).

This requires a generator change (not yet merged) to include AUTO_LOWPP_* classes in the .pxd file so they can be cimport'ed. I know we don't usually do that, but it seems important to be able to use those high-level bindings and not repeat ourselves. ABI stability there should be ok -- I don't anticipate needing to change anything on the .pxd side of those classes.

Following the nvutil design, this initializes NVML immediately upon import of cuda.core.system. That feels convenient and may be the right choice, but it will be hard to walk that back. Questions the NVML docs don't answer for me: are there any use cases where you would want to init/shutdown NVML repeatedly. The cuda.bindings.nvml tests do this, so I know it works. Is there any harm in init'ing and never shutting down -- we could add an atexit handler, but I don't know if it's required.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Dec 17, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

from cuda.bindings cimport _nvml as nvml


def get_driver_version() -> tuple[int, int]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing. This is an existing API that returns the CUDA version. It really should be called get_cuda_version to avoid confusion, but that would be a breaking change. There is a new API to return the driver version called get_gpu_driver_version below, but that naming isn't great.

@mdboom
Copy link
Contributor Author

mdboom commented Dec 17, 2025

/ok to test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, cuda.core supports any cuda-bindings/cuda-python 12.x and 13.x, many of which do not have the NVML bindings available. So, we need a version guard here before importing anything that would expect the bindings to exist, and raise an exception in such cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good reminder. I guess that precludes cimport'ing anything from cuda.bindings._nvml, since _nvml is a moving target. Will just take that out for now...

@mdboom
Copy link
Contributor Author

mdboom commented Dec 17, 2025

/ok to test

5 similar comments
@mdboom
Copy link
Contributor Author

mdboom commented Dec 17, 2025

/ok to test

@mdboom
Copy link
Contributor Author

mdboom commented Dec 17, 2025

/ok to test

@mdboom
Copy link
Contributor Author

mdboom commented Dec 17, 2025

/ok to test

@mdboom
Copy link
Contributor Author

mdboom commented Dec 17, 2025

/ok to test

@mdboom
Copy link
Contributor Author

mdboom commented Dec 17, 2025

/ok to test

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces the new cuda.core.system module that provides system-level GPU information via NVML (NVIDIA Management Library). It replaces the previous singleton System class with a more comprehensive module that offers both backward-compatible functions and new NVML-powered device management capabilities.

Key changes:

  • Replaces singleton System class with module-level functions (get_num_devices(), get_driver_version(), etc.)
  • Adds comprehensive Device class with NVML-backed properties for device information (architecture, memory, PCI info, etc.)
  • Implements automatic NVML initialization on module import with version-gated availability
  • Provides utility functions for formatting bytes and unpacking bitmasks

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
cuda_core/tests/test_memory.py Updates API calls from ccx_system.num_devices property to ccx_system.get_num_devices() function
cuda_core/tests/system/test_system_utils.py Adds comprehensive tests for utility functions (format_bytes, unpack_bitmask)
cuda_core/tests/system/test_system_system.py Adds tests for system-level functions (driver versions, device count, process name)
cuda_core/tests/system/test_system_device.py Adds extensive tests for Device class properties (architecture, memory, PCI info, etc.)
cuda_core/tests/system/test_nvml_context.py Adds tests for NVML initialization state management across processes
cuda_core/tests/system/conftest.py Defines NVML version requirements and skip marker for unsupported versions
cuda_core/tests/system/init.py Empty init file for test module
cuda_core/cuda/core/experimental/system/utils.pyx Implements utility functions for byte formatting and bitmask unpacking
cuda_core/cuda/core/experimental/system/system.pyx Implements system-level query functions with NVML and fallback support
cuda_core/cuda/core/experimental/system/device.pyx Implements Device class with comprehensive GPU properties via NVML
cuda_core/cuda/core/experimental/system/_nvml_context.pyx Implements thread-safe, per-process NVML initialization logic
cuda_core/cuda/core/experimental/system/init.py Module entry point with version-gated NVML imports and initialization
cuda_core/cuda/core/experimental/_system.py Removes deprecated singleton System class
cuda_core/cuda/core/experimental/init.py Updates imports to use new system module instead of System singleton
cuda_bindings/cuda/bindings/_nvml.pyx Adds enums and fixes BAR1Memory property naming (breaking change)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@leofang leofang added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Dec 18, 2025
@leofang leofang added this to the cuda.core backlog milestone Dec 18, 2025
@leofang leofang added the triage Needs the team's attention label Dec 18, 2025
@mdboom mdboom marked this pull request as ready for review December 18, 2025 13:22
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Dec 18, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@mdboom
Copy link
Contributor Author

mdboom commented Dec 18, 2025

/ok to test

@github-actions
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do! triage Needs the team's attention

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants