Fix segmentation fault in distributed batched state evolution with `store_intermediate_results` by huaweil-nv · Pull Request #3771 · NVIDIA/cuda-quantum

huaweil-nv · 2026-01-21T13:40:26Z

Description

This PR fixes a segmentation fault that occurs when running batched quantum state evolution with store_intermediate_results=IntermediateResultSave.ALL in a multi-GPU MPI environment.

Problem Description

When users run batched cudaq.evolve() with multiple GPUs and request intermediate results, the program crashes with a segmentation fault:

# This crashes on multi-GPU systems
results = cudaq.evolve(
    hamiltonian,
    dimensions,
    schedule,
    [state1, state2, state3, state4],  # Batched states
    store_intermediate_results=cudaq.IntermediateResultSave.ALL,  # Triggers the bug
    ...
)

Error output:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PROCESS ID 12345 RUNNING AT hostname
=   EXIT CODE: 11 (Segmentation fault)
===================================================================================

Reproduction

Prerequisites

Requirement	Description
Hardware	Multi-GPU system (2+ NVIDIA GPUs, e.g., 2x A100)
Software	CUDA-Q with dynamics target (`cudaq.has_target('dynamics')`)
MPI	MPI runtime with GPU support (e.g., OpenMPI)
cuDensityMat plugin	Installed (typically via `CUDAQ_DYNAMICS_MPI_COMM_LIB`)

Conditions to Trigger the Bug

All three conditions must be met simultaneously:

Multi-GPU MPI execution: mpirun -np N where N > 1
Batched initial states: Pass a list of states to cudaq.evolve() (e.g., [state1, state2, state3, state4])
Store intermediate results: store_intermediate_results=cudaq.IntermediateResultSave.ALL

Note: If any one condition is missing, the bug does NOT occur:

Single GPU → works fine (no distribution needed)
Single initial state → works fine (no batching)
store_intermediate_results=NONE or LAST_ONLY → works fine (doesn't call splitBatchedState)

Minimal Reproducer

# reproducer.py
import cudaq
import cupy as cp
import numpy as np
from cudaq import spin, Schedule, RungeKuttaIntegrator

cudaq.mpi.initialize()
cudaq.set_target('dynamics')

# Simple single-qubit Hamiltonian
hamiltonian = 2 * np.pi * 0.1 * spin.x(0)
dimensions = {0: 2}
schedule = Schedule(np.linspace(0, 1, 11), ['time'])

# Create 4 distinct initial states (batched)
initial_states = []
for i in range(4):
    theta = i * np.pi / 8
    state_data = cp.array([np.cos(theta), np.sin(theta)], dtype=cp.complex128)
    initial_states.append(cudaq.State.from_data(state_data))

# This crashes before the fix (Segmentation fault or incorrect results)
results = cudaq.evolve(
    hamiltonian,
    dimensions,
    schedule,
    initial_states,  # batched states
    store_intermediate_results=cudaq.IntermediateResultSave.ALL,  # triggers splitBatchedState
    integrator=RungeKuttaIntegrator())

# Verify results (won't reach here before fix)
for i, result in enumerate(results):
    states = result.intermediate_states()
    print(f"Rank {cudaq.mpi.rank()}: State {i} has {len(states)} intermediate states")

cudaq.mpi.finalize()

Run Command

# Set MPI comm library if needed
export CUDAQ_DYNAMICS_MPI_COMM_LIB=/path/to/libcudaq_distributed_interface_mpi.so

# Run with 2 GPUs
mpirun -np 2 python reproducer.py

Expected Behavior (Before Fix)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 11 (Segmentation fault)
===================================================================================

Expected Behavior (After Fix)

Rank 0: State 0 has 11 intermediate states
Rank 0: State 1 has 11 intermediate states
Rank 1: State 2 has 11 intermediate states
Rank 1: State 3 has 11 intermediate states

Root Cause Analysis

The bug is in the existing distributed batched state implementation. The code path for distributing batched states across GPUs exists (distributeBatchedStateData), but the reverse operation (splitBatchedState) was never updated to handle distributed states correctly.

Bug 1: `CuDensityMatState::splitBatchedState` (C++ layer)

Original buggy code (line 785-793):

const int64_t stateSize = batchedState.dimension / batchedState.batchSize;
// ...
for (int i = 0; i < batchedState.batchSize; ++i) {
    // Read from ptr + i * stateSize
}

Problem: In distributed mode:

batchedState.dimension = local buffer size (e.g., 4 for 2 states on this rank)
batchedState.batchSize = total batch size (e.g., 4 states across all ranks)
stateSize = 4 / 4 = 1 ❌ (should be 2)
Loop tries to read 4 states from a buffer that only holds 2 → segmentation fault

Bug 2: `cudm_solver.py` (Python layer)

Original buggy code:

split_states = bindings.splitBatchedState(state)
for i in range(batch_size):  # Assumes len(split_states) == batch_size
    intermediate_states[i].append(split_states[i])  # IndexError in distributed mode

Problem: In distributed mode, splitBatchedState only returns the local subset of states, not all batch_size states. Indexing with batch_size causes an IndexError.

copy-pr-bot · 2026-01-21T13:40:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…volution This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com>

python/tests/parallel/test_mpi_dynamics.py

schweitzpgi · 2026-01-21T18:06:58Z

~~This looks like it should be merged into the python rewrite branch to me.~~

Good to go!

python/tests/parallel/test_mpi_dynamics.py

sacpis

LGTM. Thanks @huaweil-nv. May be @1tnguyen can take another look at it.

1tnguyen · 2026-01-21T23:53:27Z

/ok to test 1c11fa1

Command Bot: Processing...

1tnguyen

LGTM 👍

I've run the newly added tests and confirmed the fix. Thanks @huaweil-nv!

github-actions · 2026-01-22T01:44:12Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

1tnguyen · 2026-01-22T20:45:50Z

/ok to test 2cc2d70

Command Bot: Processing...

github-actions · 2026-01-22T22:30:52Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

…volution (NVIDIA#3771) This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>

…volution (#3771) This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>

…volution (NVIDIA#3771) This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>

…volution (NVIDIA#3771) This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Update mgpu SHA to fix build and support DGX Spark (#3738) Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Isolate Logger.h from fmtlib headers (#3764) * Isolate Logger.h from fmtlib headers Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix density matrix indexing bug in CuDensityMatState::operator() (#3768) Bug: The operator() function used total dimension (dim*dim) instead of single-side dimension (dim) for bounds checking and linear index calculation when accessing density matrix elements. For a 4x4 density matrix (dimension=16): - Bug computed linear index as i * 16 + j (wrong) - Correct is i * 4 + j Impact: - Valid indices like (1,1) would crash with CUDA memory error - Invalid indices like (0,4) would silently pass bounds check Fix: Use sqrt(dimension) to compute single-side dimension for both bounds checking and linear index calculation. Added regression tests in both Python and C++ test suites. Signed-off-by: huaweil <huaweil@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Removing fmtlib headers from all headers (#3770) Removing fmtlib headers from all headers Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reducing cut-and-paste on the backend unit tests with cmake function (#3773) * Reducing cut-and-paste on the backend unit tests with cmake function Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com> * Removing debugging code Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com> --------- Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * [build] Optional sanitizers build flags in build script (#3772) Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * [monolithic change] Rework the way CUDA-Q embeds in Python (#3693) [Python] Total rewrite of the python/CUDA-Q interface. The current implementation of the Python handling of CUDA-Q has baked in various attempts to deal with the language coupling between Python and CUDA-Q kernels. These solutions have been accumulating and making it more and more difficult to work on the Python implementation. These changes are a total rewrite to bring the Python implementation more closely aligned with the C++ implementation. Changes: - The kernel builder and kernel decorator are fundamentally different and will no longer share a duck-typed interface. It doesn't work well. The builder assembles a CUDA-Q kernel dynamically. As such all symbolic references are known immediately. The decorator converts a static AST of code into a CUDA-Q kernel. Symbolic references are either local or not. Non-local symbols are unknown at the point the decorator is processed. All non-local symbols in a decorator are recorded with the decorator itself and lambda lifted as actual arguments. - MLIR requires that symbols be uniqued. The previous implementation ignored this requirement. - Lazy state maintenance in Python and the C++ runtime layers is buggy and not needed. It is removed. This includes dangling MLIR bindings from the AST bridge's python MLIR bindings. - Kernels are no longer built with assumptions, then rebuilt when those guesses prove wrong. Kernels are no longer built and rebuilt for different steps in the process. A kernel decorator builds a target agnostic, context independent kernel, and saves that MLIR ModuleOp under a unique name. - Launch scenarios have been reworked and consolidated to use the ModuleOp directly instead of shuffling between string representations (possibly under maps that were not thread-safe) and ModuleOp instances. - Every step of the process creating a brand new MLIRContext and loading all the dialects into that context, etc. is removed. This is done once and the Python interpreter uses the same context to build all modules. Other changes include: Fix GIL issue in get_state_async. Restructure lambda lifting so it handles recursive ops. Clone the ModuleOps as they go into the make_copyable_function closure to prevent them from being erased along the way. Remove VQE tests. Use VQE from CUDA-QX! Simplifying cudaq::state support. Handle kernel decorator from import module Simplify the symbol table. Python is not a scoped language other than LEGB. Convert kernel builder to generate code compatible with C++ for state initialization of veqs. Refactor the AST bridge to generate state objects from the runtime. Fixes for various tests. and many other changes! Signed-off-by: Eric Schweitz <eschweitz@nvidia.com> Co-authored-by: Thien Nguyen <thiennguyen@nvidia.com> Co-authored-by: Bettina Heim <heimb@outlook.com> Co-authored-by: Sachin Pisal <spisal@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix segfault in splitBatchedState for distributed multi-GPU batched evolution (#3771) This commit fixes a critical bug in the distributed batched state handling for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems. Root causes fixed: 1. distributeBatchedStateData: Incorrect batch index calculation caused out-of-bounds memory access when distributing state data across GPUs. 2. splitBatchedState: Used local dimension with global batch size, causing incorrect state size calculation and wrong number of states per GPU. 3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states, but in distributed mode it correctly returns only local subset. Changes: - Add singleStateDimension field to CuDensityMatState to track individual state dimension within a batch - Fix batch index calculation using cuDensityMat API's batchModeLocation - Update splitBatchedState to use singleStateDimension for correct sizing - Update Python solver to handle distributed partial results correctly - Add comprehensive MPI tests for distributed batched evolution scenarios Signed-off-by: huaweil <huaweil@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * [Dynamics] Reduce data type conversion overhead in Torch integrator implementation (#3779) * Reduce overhead in data type conversion Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com> * Fix test Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com> * Fix target info for dynamics Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com> --------- Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * fix(python): add qpu_id parameter to observe function (#3739) Signed-off-by: AndrewTKent <57512897+AndrewTKent@users.noreply.github.com> Signed-off-by: Luca Mondada <luca@mondada.net> Co-authored-by: Luca Mondada <luca@mondada.net> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Python <-> C interoperability (#3791) * Python <-> C interoperability This replaces the old interop interface with one that hides more of the implementation details about the compiler and runtime libraries from the user. This is not quite as dynamic as native Python as kernel decorators may be resolved in a python interpreter context *before* they are used. This allows kernel decorator code to be pickled before use by C++ kernel code, which in turn allows the C++ code to largely be unaware that the kernel it is calling isn't just another C++ kernel. The one caveat to this is that a qkernel object holding a pointer to the entry point function of a kernel decorator must be annotated so that the runtime layer can distinguish it and know it need not try to find a host-side entry point in the C++ runtime dictionary. Signed-off-by: Eric Schweitz <eschweitz@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * - Fix Linker args issue with OSX. - Fix Zlib/minizip issue with OSX and brew packages Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix issues in the finding of C++ headers in a cross-platform manner (including osx/linux) by injecting through CMAKE instead of doing the dynamic lookup in the cpp program. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Add OSX isel workarounds and documentation of existing workarounds. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Continue working around dylib issues by disabling static linking if dylib is being used which is currently required for OSX. The stack of hacks is quickly growing though and we might consider dropping dylibs and going back to static linking. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * - Use semicolon-separated CMAKE_INSTALL_RPATH on macOS - Use @executable_path instead of $ORIGIN for macOS rpath - Fix nvq++ mktemp template - Use COMPILER_FLAGS for backendConfig.cpp compilation - Replace .so with %cudaq_plugin_ext in plugin tests - Replace |& with 2>&1 | for POSIX shell compatibility - Add DISCOVERY_TIMEOUT 120 to backend unit tests Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Revert to LLVM idempotent registration patch approach as I was unable to find a way to fix the issue with static initializers. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Revert to static LLVM/MLIR linking with flat_namespace on macOS - Update CMake configuration to use static LLVM/MLIR libraries instead of dylib. - Add -Wl,-flat_namespace linker flag for macOS symbol resolution. - Fix use-after-free bug in LoopUnrollPatterns.inc when allow-early-exit is enabled. - Replace bash-only |& syntax with POSIX-compatible 2>&1 | in test files. - Remove mlir_execution_engine_shared.diff patch (no longer needed). - Update build_llvm.sh script for static library build. - Fix force loading to be a bit more careful about where used with LLVMCodeGen Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Force loading of LLVMCodeGen to fix availability issues with fast-isel due to two-level namespacing in OSX. Bugs uncovered on OSX: - ConvertStmt.cpp: Fix use-after-free by saving operands before erasing call. - CombineMeasurements.cpp: Return success() after erase/replace ops for rewriter/ - RegToMem.cpp: Return WalkResult::skip() after erasing ops during walk. - ArgumentConversion.cpp: Fix lambda capture by value instead of reference. Updating tests for OSX: - Fix regex patterns for OSX ABI. - A few floating point regexes - Mark one of the KAK tests as unsupported on OSX as the values are different due to a different found decomposition. We could have used the OSX values but it seemed lik e overkill. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Address review comments. Remove unused additions and add new comments. Add new requirements-dev.txt. Add new library target for utilities to work around issues with link order. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Test fixes and linker fixes from previous changes with full test rerun. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix for braket tests. - Make openssl build use cmake to avoid package config resolution issues with flat namespace static linking - Force linking of cudaq common to use two-level namespace to prevent collisions/bugs with flat namespace and open ssl. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Updating year for the copyright header (#3710) * Upgrading year for the copyright header * updating year to 2026 * updating config * restoring headers for contributors Signed-off-by: Thomas Alexander <talexander@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fixes required to build from fresh on MacOS ARM. Some tests failing. Will fix in upcoming PRs. Also migrated commits/building docs from macos wheel support PR for build system as these are better suited for this PR. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Restore build_wheel.sh script Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Remove orphaned files not in upstream/main Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix CMakeLists.txt: matrix.cpp is in operators/, not utils/ Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset Pipelines.cpp to upstream (references deleted passes) Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Update include path: matrix.h moved from utils/ to operators/ Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix: Apple ld doesn't support --start-group Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix: Apple ld doesn't support --unresolved-symbols Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix macOS RPATH: use semicolons and @loader_path Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Remove orphaned test_vqe.py (not in upstream/main) Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset test_assignments.py to upstream/main Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Restore macOS ARM64 JIT exception skip decorators for tests Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Restore macOS support in wheel scripts Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset test_resource_counter.py to upstream/main Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Restore macOS conditional CUDA dependencies in pyproject.toml Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Remove orphaned test_vqe_kernel.py (not in upstream/main) Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset test_unpack.py to upstream/main Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset test_kernel_shift_operators.py to upstream/main Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Add macOS ARM64 JIT exception skip to test_sample_in_choice Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Add macOS ARM64 JIT exception skip to test_unsupported_calls Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Revert "Add macOS ARM64 JIT exception skip to test_unsupported_calls" This reverts commit 474e38c. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Revert copyright-only changes to match upstream/main Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Remove orphaned state_preparation tests and fix remaining copyrights Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset files to osx-cuda-quantum-support branch Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Delete orphaned files from python.redesign.0 branch Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset files to upstream/main Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Delete orphaned idempotent_option_registration.diff from python.redesign.0 branch Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset runtime/cudaq/CMakeLists.txt to osx-cuda-quantum-support Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Restore DISCOVERY_TIMEOUT 120 to gtest_discover_tests calls Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset test/AST-Quake/qalloc_initialization.cpp to osx-cuda-quantum-support Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Remove docs/OSX_BUILD.md Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Restore macOS-specific fixes from osx-cuda-quantum-support - Add Python_SITELIB to PYTHONPATH for test_domains (fixes numpy not found) - Add nvqir-qpp link to test_qudit for macOS - Add UNSUPPORTED/XFAIL markers for darwin-arm64 on JIT exception tests Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Update qutip dependency to >5 to match upstream Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Clean up CMake comments: remove duplicate notes, add Apple linker comment Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset kernel_builder.cpp: remove unnecessary macOS guard Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Reset files to upstream/base: remove unnecessary macOS guards and whitespace changes Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Address review comments. Fix rpath and cu reference in build wheel. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix indentation. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Move macos skip to mark. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Code formatting. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Fix rebase failure undo changes. Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> * Address PR #3717 review comments - Format bash scripts with shfmt (4-space indentation) - Add missing newline at end of find_wheel_assets.sh - Remove unnecessary darwin markers from pyproject.toml.cu12 (macOS only uses cu13, so cu12 doesn't need darwin handling) --------- Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com> Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com> Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com> Signed-off-by: huaweil <huaweil@nvidia.com> Signed-off-by: Eric Schweitz <eschweitz@nvidia.com> Signed-off-by: AndrewTKent <57512897+AndrewTKent@users.noreply.github.com> Signed-off-by: Luca Mondada <luca@mondada.net> Signed-off-by: Thomas Alexander <talexander@nvidia.com> Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com> Co-authored-by: Renaud Kauffmann <rkauffmann@nvidia.com> Co-authored-by: huaweil <93200147+huaweil-nv@users.noreply.github.com> Co-authored-by: Luca Mondada <72734770+lmondada@users.noreply.github.com> Co-authored-by: Eric Schweitz <eschweitz@nvidia.com> Co-authored-by: Thien Nguyen <thiennguyen@nvidia.com> Co-authored-by: Bettina Heim <heimb@outlook.com> Co-authored-by: Sachin Pisal <spisal@nvidia.com> Co-authored-by: AndrewTKent <57512897+AndrewTKent@users.noreply.github.com> Co-authored-by: Luca Mondada <luca@mondada.net>

huaweil-nv requested review from 1tnguyen and sacpis January 21, 2026 13:40

huaweil-nv force-pushed the fix/distributed-batched-state-segfault branch from 13169b6 to 1c11fa1 Compare January 21, 2026 13:48

schweitzpgi reviewed Jan 21, 2026

View reviewed changes

python/tests/parallel/test_mpi_dynamics.py Show resolved Hide resolved

schweitzpgi added the python-lang Anything related to the Python CUDA Quantum language implementation label Jan 21, 2026

sacpis reviewed Jan 21, 2026

View reviewed changes

python/tests/parallel/test_mpi_dynamics.py Show resolved Hide resolved

sacpis approved these changes Jan 21, 2026

View reviewed changes

1tnguyen approved these changes Jan 22, 2026

View reviewed changes

github-actions bot pushed a commit that referenced this pull request Jan 22, 2026

Docs preview for PR #3771.

a235747

1tnguyen added 2 commits January 23, 2026 07:38

Merge branch 'main' into fix/distributed-batched-state-segfault

646427d

Merge branch 'main' into fix/distributed-batched-state-segfault

2cc2d70

github-actions bot pushed a commit that referenced this pull request Jan 22, 2026

Docs preview for PR #3771.

3d4f305

1tnguyen merged commit 1f83ff2 into NVIDIA:main Jan 22, 2026
193 checks passed

github-actions bot pushed a commit that referenced this pull request Jan 22, 2026

Cleaning up docs preview for PR #3771.

88d65be

1tnguyen mentioned this pull request Jan 26, 2026

Handling distributed batched results for dynamics backend #3794

Merged

bettinaheim added the bug fix To be listed under Bug Fixes in the release notes label Mar 12, 2026

bettinaheim added this to the release 0.14.0 milestone Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix segmentation fault in distributed batched state evolution with `store_intermediate_results`#3771

Fix segmentation fault in distributed batched state evolution with `store_intermediate_results`#3771
1tnguyen merged 3 commits intoNVIDIA:mainfrom
huaweil-nv:fix/distributed-batched-state-segfault

huaweil-nv commented Jan 21, 2026

Uh oh!

copy-pr-bot bot commented Jan 21, 2026

Uh oh!

Uh oh!

schweitzpgi commented Jan 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

sacpis left a comment

Uh oh!

1tnguyen commented Jan 21, 2026 •

edited by github-actions bot

Loading

Uh oh!

1tnguyen left a comment

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

1tnguyen commented Jan 22, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

huaweil-nv commented Jan 21, 2026

Description

Problem Description

Reproduction

Prerequisites

Conditions to Trigger the Bug

Minimal Reproducer

Run Command

Expected Behavior (Before Fix)

Expected Behavior (After Fix)

Root Cause Analysis

Bug 1: CuDensityMatState::splitBatchedState (C++ layer)

Bug 2: cudm_solver.py (Python layer)

Uh oh!

copy-pr-bot bot commented Jan 21, 2026

Uh oh!

Uh oh!

schweitzpgi commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sacpis left a comment

Choose a reason for hiding this comment

Uh oh!

1tnguyen commented Jan 21, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

1tnguyen left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

1tnguyen commented Jan 22, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Bug 1: `CuDensityMatState::splitBatchedState` (C++ layer)

Bug 2: `cudm_solver.py` (Python layer)

schweitzpgi commented Jan 21, 2026 •

edited

Loading

1tnguyen commented Jan 21, 2026 •

edited by github-actions bot

Loading

1tnguyen commented Jan 22, 2026 •

edited by github-actions bot

Loading