Fix segmentation fault in distributed batched state evolution with store_intermediate_results#3771
Merged
1tnguyen merged 3 commits intoNVIDIA:mainfrom Jan 22, 2026
Conversation
…volution This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com>
13169b6 to
1c11fa1
Compare
schweitzpgi
reviewed
Jan 21, 2026
Collaborator
|
Good to go! |
sacpis
reviewed
Jan 21, 2026
sacpis
approved these changes
Jan 21, 2026
Collaborator
sacpis
left a comment
There was a problem hiding this comment.
LGTM. Thanks @huaweil-nv. May be @1tnguyen can take another look at it.
Collaborator
Command Bot: Processing... |
1tnguyen
approved these changes
Jan 22, 2026
Collaborator
1tnguyen
left a comment
There was a problem hiding this comment.
LGTM 👍
I've run the newly added tests and confirmed the fix. Thanks @huaweil-nv!
|
CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
Collaborator
Command Bot: Processing... |
|
CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
taalexander
pushed a commit
to taalexander/cuda-quantum
that referenced
this pull request
Jan 27, 2026
…volution (NVIDIA#3771) This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
copy-pr-bot bot
pushed a commit
that referenced
this pull request
Jan 28, 2026
…volution (#3771) This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
copy-pr-bot bot
pushed a commit
that referenced
this pull request
Jan 28, 2026
…volution (#3771) This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
copy-pr-bot bot
pushed a commit
that referenced
this pull request
Jan 28, 2026
…volution (#3771) This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
taalexander
pushed a commit
to taalexander/cuda-quantum
that referenced
this pull request
Jan 30, 2026
…volution (NVIDIA#3771) This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
taalexander
pushed a commit
to taalexander/cuda-quantum
that referenced
this pull request
Jan 30, 2026
…volution (NVIDIA#3771) This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
taalexander
pushed a commit
to taalexander/cuda-quantum
that referenced
this pull request
Jan 30, 2026
…volution (NVIDIA#3771) This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
taalexander
added a commit
that referenced
this pull request
Jan 30, 2026
* Update mgpu SHA to fix build and support DGX Spark (#3738) Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Isolate Logger.h from fmtlib headers (#3764) * Isolate Logger.h from fmtlib headers Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix density matrix indexing bug in CuDensityMatState::operator() (#3768) Bug: The operator() function used total dimension (dim*dim) instead of single-side dimension (dim) for bounds checking and linear index calculation when accessing density matrix elements. For a 4x4 density matrix (dimension=16): - Bug computed linear index as i * 16 + j (wrong) - Correct is i * 4 + j Impact: - Valid indices like (1,1) would crash with CUDA memory error - Invalid indices like (0,4) would silently pass bounds check Fix: Use sqrt(dimension) to compute single-side dimension for both bounds checking and linear index calculation. Added regression tests in both Python and C++ test suites. Signed-off-by: huaweil <huaweil@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Removing fmtlib headers from all headers (#3770) Removing fmtlib headers from all headers Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reducing cut-and-paste on the backend unit tests with cmake function (#3773) * Reducing cut-and-paste on the backend unit tests with cmake function Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com> * Removing debugging code Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com> --------- Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * [build] Optional sanitizers build flags in build script (#3772) Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * [monolithic change] Rework the way CUDA-Q embeds in Python (#3693) [Python] Total rewrite of the python/CUDA-Q interface. The current implementation of the Python handling of CUDA-Q has baked in various attempts to deal with the language coupling between Python and CUDA-Q kernels. These solutions have been accumulating and making it more and more difficult to work on the Python implementation. These changes are a total rewrite to bring the Python implementation more closely aligned with the C++ implementation. Changes: - The kernel builder and kernel decorator are fundamentally different and will no longer share a duck-typed interface. It doesn't work well. The builder assembles a CUDA-Q kernel dynamically. As such all symbolic references are known immediately. The decorator converts a static AST of code into a CUDA-Q kernel. Symbolic references are either local or not. Non-local symbols are unknown at the point the decorator is processed. All non-local symbols in a decorator are recorded with the decorator itself and lambda lifted as actual arguments. - MLIR requires that symbols be uniqued. The previous implementation ignored this requirement. - Lazy state maintenance in Python and the C++ runtime layers is buggy and not needed. It is removed. This includes dangling MLIR bindings from the AST bridge's python MLIR bindings. - Kernels are no longer built with assumptions, then rebuilt when those guesses prove wrong. Kernels are no longer built and rebuilt for different steps in the process. A kernel decorator builds a target agnostic, context independent kernel, and saves that MLIR ModuleOp under a unique name. - Launch scenarios have been reworked and consolidated to use the ModuleOp directly instead of shuffling between string representations (possibly under maps that were not thread-safe) and ModuleOp instances. - Every step of the process creating a brand new MLIRContext and loading all the dialects into that context, etc. is removed. This is done once and the Python interpreter uses the same context to build all modules. Other changes include: Fix GIL issue in get_state_async. Restructure lambda lifting so it handles recursive ops. Clone the ModuleOps as they go into the make_copyable_function closure to prevent them from being erased along the way. Remove VQE tests. Use VQE from CUDA-QX! Simplifying cudaq::state support. Handle kernel decorator from import module Simplify the symbol table. Python is not a scoped language other than LEGB. Convert kernel builder to generate code compatible with C++ for state initialization of veqs. Refactor the AST bridge to generate state objects from the runtime. Fixes for various tests. and many other changes! Signed-off-by: Eric Schweitz <eschweitz@nvidia.com> Co-authored-by: Thien Nguyen <thiennguyen@nvidia.com> Co-authored-by: Bettina Heim <heimb@outlook.com> Co-authored-by: Sachin Pisal <spisal@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix segfault in splitBatchedState for distributed multi-GPU batched evolution (#3771) This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * [Dynamics] Reduce data type conversion overhead in Torch integrator implementation (#3779) * Reduce overhead in data type conversion Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com> * Fix test Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com> * Fix target info for dynamics Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com> --------- Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * fix(python): add qpu_id parameter to observe function (#3739) Signed-off-by: AndrewTKent <57512897+AndrewTKent@users.noreply.github.com> Signed-off-by: Luca Mondada <luca@mondada.net> Co-authored-by: Luca Mondada <luca@mondada.net> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Python <-> C interoperability (#3791) * Python <-> C interoperability This replaces the old interop interface with one that hides more of the implementation details about the compiler and runtime libraries from the user. This is not quite as dynamic as native Python as kernel decorators may be resolved in a python interpreter context *before* they are used. This allows kernel decorator code to be pickled before use by C++ kernel code, which in turn allows the C++ code to largely be unaware that the kernel it is calling isn't just another C++ kernel. The one caveat to this is that a qkernel object holding a pointer to the entry point function of a kernel decorator must be annotated so that the runtime layer can distinguish it and know it need not try to find a host-side entry point in the C++ runtime dictionary. Signed-off-by: Eric Schweitz <eschweitz@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * - Fix Linker args issue with OSX. - Fix Zlib/minizip issue with OSX and brew packages Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix issues in the finding of C++ headers in a cross-platform manner (including osx/linux) by injecting through CMAKE instead of doing the dynamic lookup in the cpp program. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Add OSX isel workarounds and documentation of existing workarounds. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Continue working around dylib issues by disabling static linking if dylib is being used which is currently required for OSX. The stack of hacks is quickly growing though and we might consider dropping dylibs and going back to static linking. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * - Use semicolon-separated CMAKE_INSTALL_RPATH on macOS - Use @executable_path instead of $ORIGIN for macOS rpath - Fix nvq++ mktemp template - Use COMPILER_FLAGS for backendConfig.cpp compilation - Replace .so with %cudaq_plugin_ext in plugin tests - Replace |& with 2>&1 | for POSIX shell compatibility - Add DISCOVERY_TIMEOUT 120 to backend unit tests Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Revert to LLVM idempotent registration patch approach as I was unable to find a way to fix the issue with static initializers. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Revert to static LLVM/MLIR linking with flat_namespace on macOS - Update CMake configuration to use static LLVM/MLIR libraries instead of dylib. - Add -Wl,-flat_namespace linker flag for macOS symbol resolution. - Fix use-after-free bug in LoopUnrollPatterns.inc when allow-early-exit is enabled. - Replace bash-only |& syntax with POSIX-compatible 2>&1 | in test files. - Remove mlir_execution_engine_shared.diff patch (no longer needed). - Update build_llvm.sh script for static library build. - Fix force loading to be a bit more careful about where used with LLVMCodeGen Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Force loading of LLVMCodeGen to fix availability issues with fast-isel due to two-level namespacing in OSX. Bugs uncovered on OSX: - ConvertStmt.cpp: Fix use-after-free by saving operands before erasing call. - CombineMeasurements.cpp: Return success() after erase/replace ops for rewriter/ - RegToMem.cpp: Return WalkResult::skip() after erasing ops during walk. - ArgumentConversion.cpp: Fix lambda capture by value instead of reference. Updating tests for OSX: - Fix regex patterns for OSX ABI. - A few floating point regexes - Mark one of the KAK tests as unsupported on OSX as the values are different due to a different found decomposition. We could have used the OSX values but it seemed lik e overkill. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Address review comments. Remove unused additions and add new comments. Add new requirements-dev.txt. Add new library target for utilities to work around issues with link order. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Test fixes and linker fixes from previous changes with full test rerun. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix for braket tests. - Make openssl build use cmake to avoid package config resolution issues with flat namespace static linking - Force linking of cudaq common to use two-level namespace to prevent collisions/bugs with flat namespace and open ssl. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Updating year for the copyright header (#3710) * Upgrading year for the copyright header * updating year to 2026 * updating config * restoring headers for contributors Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fixes required to build from fresh on MacOS ARM. Some tests failing. Will fix in upcoming PRs. Also migrated commits/building docs from macos wheel support PR for build system as these are better suited for this PR. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Restore build_wheel.sh script Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Remove orphaned files not in upstream/main Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix CMakeLists.txt: matrix.cpp is in operators/, not utils/ Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset Pipelines.cpp to upstream (references deleted passes) Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Update include path: matrix.h moved from utils/ to operators/ Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix: Apple ld doesn't support --start-group Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix: Apple ld doesn't support --unresolved-symbols Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix macOS RPATH: use semicolons and @loader_path Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Remove orphaned test_vqe.py (not in upstream/main) Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset test_assignments.py to upstream/main Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Restore macOS ARM64 JIT exception skip decorators for tests Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Restore macOS support in wheel scripts Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset test_resource_counter.py to upstream/main Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Restore macOS conditional CUDA dependencies in pyproject.toml Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Remove orphaned test_vqe_kernel.py (not in upstream/main) Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset test_unpack.py to upstream/main Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset test_kernel_shift_operators.py to upstream/main Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Add macOS ARM64 JIT exception skip to test_sample_in_choice Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Add macOS ARM64 JIT exception skip to test_unsupported_calls Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Revert "Add macOS ARM64 JIT exception skip to test_unsupported_calls" This reverts commit 474e38c. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Revert copyright-only changes to match upstream/main Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Remove orphaned state_preparation tests and fix remaining copyrights Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset files to osx-cuda-quantum-support branch Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Delete orphaned files from python.redesign.0 branch Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset files to upstream/main Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Delete orphaned idempotent_option_registration.diff from python.redesign.0 branch Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset runtime/cudaq/CMakeLists.txt to osx-cuda-quantum-support Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Restore DISCOVERY_TIMEOUT 120 to gtest_discover_tests calls Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset test/AST-Quake/qalloc_initialization.cpp to osx-cuda-quantum-support Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Remove docs/OSX_BUILD.md Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Restore macOS-specific fixes from osx-cuda-quantum-support - Add Python_SITELIB to PYTHONPATH for test_domains (fixes numpy not found) - Add nvqir-qpp link to test_qudit for macOS - Add UNSUPPORTED/XFAIL markers for darwin-arm64 on JIT exception tests Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Update qutip dependency to >5 to match upstream Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Clean up CMake comments: remove duplicate notes, add Apple linker comment Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset kernel_builder.cpp: remove unnecessary macOS guard Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset files to upstream/base: remove unnecessary macOS guards and whitespace changes Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Address review comments. Fix rpath and cu reference in build wheel. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix indentation. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Move macos skip to mark. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Code formatting. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix rebase failure undo changes. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Address PR #3717 review comments - Format bash scripts with shfmt (4-space indentation) - Add missing newline at end of find_wheel_assets.sh - Remove unnecessary darwin markers from pyproject.toml.cu12 (macOS only uses cu13, so cu12 doesn't need darwin handling) --------- Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com> Signed-off-by: huaweil <huaweil@nvidia.com> Signed-off-by: Eric Schweitz <eschweitz@nvidia.com> Signed-off-by: AndrewTKent <57512897+AndrewTKent@users.noreply.github.com> Signed-off-by: Luca Mondada <luca@mondada.net> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com> Co-authored-by: Renaud Kauffmann <rkauffmann@nvidia.com> Co-authored-by: huaweil <93200147+huaweil-nv@users.noreply.github.com> Co-authored-by: Luca Mondada <72734770+lmondada@users.noreply.github.com> Co-authored-by: Eric Schweitz <eschweitz@nvidia.com> Co-authored-by: Thien Nguyen <thiennguyen@nvidia.com> Co-authored-by: Bettina Heim <heimb@outlook.com> Co-authored-by: Sachin Pisal <spisal@nvidia.com> Co-authored-by: AndrewTKent <57512897+AndrewTKent@users.noreply.github.com> Co-authored-by: Luca Mondada <luca@mondada.net>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes a segmentation fault that occurs when running batched quantum state evolution with
store_intermediate_results=IntermediateResultSave.ALLin a multi-GPU MPI environment.Problem Description
When users run batched
cudaq.evolve()with multiple GPUs and request intermediate results, the program crashes with a segmentation fault:Error output:
Reproduction
Prerequisites
cudaq.has_target('dynamics'))CUDAQ_DYNAMICS_MPI_COMM_LIB)Conditions to Trigger the Bug
All three conditions must be met simultaneously:
mpirun -np NwhereN > 1cudaq.evolve()(e.g.,[state1, state2, state3, state4])store_intermediate_results=cudaq.IntermediateResultSave.ALLNote: If any one condition is missing, the bug does NOT occur:
store_intermediate_results=NONEorLAST_ONLY→ works fine (doesn't callsplitBatchedState)Minimal Reproducer
Run Command
Expected Behavior (Before Fix)
Expected Behavior (After Fix)
Root Cause Analysis
The bug is in the existing distributed batched state implementation. The code path for distributing batched states across GPUs exists (
distributeBatchedStateData), but the reverse operation (splitBatchedState) was never updated to handle distributed states correctly.Bug 1:
CuDensityMatState::splitBatchedState(C++ layer)Original buggy code (line 785-793):
Problem: In distributed mode:
batchedState.dimension= local buffer size (e.g., 4 for 2 states on this rank)batchedState.batchSize= total batch size (e.g., 4 states across all ranks)stateSize = 4 / 4 = 1❌ (should be 2)Bug 2:
cudm_solver.py(Python layer)Original buggy code:
Problem: In distributed mode,
splitBatchedStateonly returns the local subset of states, not allbatch_sizestates. Indexing withbatch_sizecauses anIndexError.