Skip to content

Fix segmentation fault in distributed batched state evolution with store_intermediate_results#3771

Merged
1tnguyen merged 3 commits intoNVIDIA:mainfrom
huaweil-nv:fix/distributed-batched-state-segfault
Jan 22, 2026
Merged

Fix segmentation fault in distributed batched state evolution with store_intermediate_results#3771
1tnguyen merged 3 commits intoNVIDIA:mainfrom
huaweil-nv:fix/distributed-batched-state-segfault

Conversation

@huaweil-nv
Copy link
Collaborator

Description

This PR fixes a segmentation fault that occurs when running batched quantum state evolution with store_intermediate_results=IntermediateResultSave.ALL in a multi-GPU MPI environment.

Problem Description

When users run batched cudaq.evolve() with multiple GPUs and request intermediate results, the program crashes with a segmentation fault:

# This crashes on multi-GPU systems
results = cudaq.evolve(
    hamiltonian,
    dimensions,
    schedule,
    [state1, state2, state3, state4],  # Batched states
    store_intermediate_results=cudaq.IntermediateResultSave.ALL,  # Triggers the bug
    ...
)

Error output:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PROCESS ID 12345 RUNNING AT hostname
=   EXIT CODE: 11 (Segmentation fault)
===================================================================================

Reproduction

Prerequisites

Requirement Description
Hardware Multi-GPU system (2+ NVIDIA GPUs, e.g., 2x A100)
Software CUDA-Q with dynamics target (cudaq.has_target('dynamics'))
MPI MPI runtime with GPU support (e.g., OpenMPI)
cuDensityMat plugin Installed (typically via CUDAQ_DYNAMICS_MPI_COMM_LIB)

Conditions to Trigger the Bug

All three conditions must be met simultaneously:

  1. Multi-GPU MPI execution: mpirun -np N where N > 1
  2. Batched initial states: Pass a list of states to cudaq.evolve() (e.g., [state1, state2, state3, state4])
  3. Store intermediate results: store_intermediate_results=cudaq.IntermediateResultSave.ALL

Note: If any one condition is missing, the bug does NOT occur:

  • Single GPU → works fine (no distribution needed)
  • Single initial state → works fine (no batching)
  • store_intermediate_results=NONE or LAST_ONLY → works fine (doesn't call splitBatchedState)

Minimal Reproducer

# reproducer.py
import cudaq
import cupy as cp
import numpy as np
from cudaq import spin, Schedule, RungeKuttaIntegrator

cudaq.mpi.initialize()
cudaq.set_target('dynamics')

# Simple single-qubit Hamiltonian
hamiltonian = 2 * np.pi * 0.1 * spin.x(0)
dimensions = {0: 2}
schedule = Schedule(np.linspace(0, 1, 11), ['time'])

# Create 4 distinct initial states (batched)
initial_states = []
for i in range(4):
    theta = i * np.pi / 8
    state_data = cp.array([np.cos(theta), np.sin(theta)], dtype=cp.complex128)
    initial_states.append(cudaq.State.from_data(state_data))

# This crashes before the fix (Segmentation fault or incorrect results)
results = cudaq.evolve(
    hamiltonian,
    dimensions,
    schedule,
    initial_states,  # batched states
    store_intermediate_results=cudaq.IntermediateResultSave.ALL,  # triggers splitBatchedState
    integrator=RungeKuttaIntegrator())

# Verify results (won't reach here before fix)
for i, result in enumerate(results):
    states = result.intermediate_states()
    print(f"Rank {cudaq.mpi.rank()}: State {i} has {len(states)} intermediate states")

cudaq.mpi.finalize()

Run Command

# Set MPI comm library if needed
export CUDAQ_DYNAMICS_MPI_COMM_LIB=/path/to/libcudaq_distributed_interface_mpi.so

# Run with 2 GPUs
mpirun -np 2 python reproducer.py

Expected Behavior (Before Fix)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 11 (Segmentation fault)
===================================================================================

Expected Behavior (After Fix)

Rank 0: State 0 has 11 intermediate states
Rank 0: State 1 has 11 intermediate states
Rank 1: State 2 has 11 intermediate states
Rank 1: State 3 has 11 intermediate states

Root Cause Analysis

The bug is in the existing distributed batched state implementation. The code path for distributing batched states across GPUs exists (distributeBatchedStateData), but the reverse operation (splitBatchedState) was never updated to handle distributed states correctly.

Bug 1: CuDensityMatState::splitBatchedState (C++ layer)

Original buggy code (line 785-793):

const int64_t stateSize = batchedState.dimension / batchedState.batchSize;
// ...
for (int i = 0; i < batchedState.batchSize; ++i) {
    // Read from ptr + i * stateSize
}

Problem: In distributed mode:

  • batchedState.dimension = local buffer size (e.g., 4 for 2 states on this rank)
  • batchedState.batchSize = total batch size (e.g., 4 states across all ranks)
  • stateSize = 4 / 4 = 1 ❌ (should be 2)
  • Loop tries to read 4 states from a buffer that only holds 2 → segmentation fault

Bug 2: cudm_solver.py (Python layer)

Original buggy code:

split_states = bindings.splitBatchedState(state)
for i in range(batch_size):  # Assumes len(split_states) == batch_size
    intermediate_states[i].append(split_states[i])  # IndexError in distributed mode

Problem: In distributed mode, splitBatchedState only returns the local subset of states, not all batch_size states. Indexing with batch_size causes an IndexError.

@huaweil-nv huaweil-nv requested review from 1tnguyen and sacpis January 21, 2026 13:40
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…volution

This commit fixes a critical bug in the distributed batched state handling
for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems.

Root causes fixed:
1. distributeBatchedStateData: Incorrect batch index calculation caused
   out-of-bounds memory access when distributing state data across GPUs.
2. splitBatchedState: Used local dimension with global batch size, causing
   incorrect state size calculation and wrong number of states per GPU.
3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states,
   but in distributed mode it correctly returns only local subset.

Changes:
- Add singleStateDimension field to CuDensityMatState to track individual
  state dimension within a batch
- Fix batch index calculation using cuDensityMat API's batchModeLocation
- Update splitBatchedState to use singleStateDimension for correct sizing
- Update Python solver to handle distributed partial results correctly
- Add comprehensive MPI tests for distributed batched evolution scenarios

Signed-off-by: huaweil <huaweil@nvidia.com>
@huaweil-nv huaweil-nv force-pushed the fix/distributed-batched-state-segfault branch from 13169b6 to 1c11fa1 Compare January 21, 2026 13:48
@schweitzpgi
Copy link
Collaborator

schweitzpgi commented Jan 21, 2026

This looks like it should be merged into the python rewrite branch to me.

Good to go!

@schweitzpgi schweitzpgi added the python-lang Anything related to the Python CUDA Quantum language implementation label Jan 21, 2026
Copy link
Collaborator

@sacpis sacpis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @huaweil-nv. May be @1tnguyen can take another look at it.

@1tnguyen
Copy link
Collaborator

1tnguyen commented Jan 21, 2026

/ok to test 1c11fa1

Command Bot: Processing...

Copy link
Collaborator

@1tnguyen 1tnguyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

I've run the newly added tests and confirmed the fix. Thanks @huaweil-nv!

github-actions bot pushed a commit that referenced this pull request Jan 22, 2026
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

@1tnguyen
Copy link
Collaborator

1tnguyen commented Jan 22, 2026

/ok to test 2cc2d70

Command Bot: Processing...

github-actions bot pushed a commit that referenced this pull request Jan 22, 2026
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

@1tnguyen 1tnguyen merged commit 1f83ff2 into NVIDIA:main Jan 22, 2026
193 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 22, 2026
taalexander pushed a commit to taalexander/cuda-quantum that referenced this pull request Jan 27, 2026
…volution (NVIDIA#3771)

This commit fixes a critical bug in the distributed batched state handling
for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems.

Root causes fixed:
1. distributeBatchedStateData: Incorrect batch index calculation caused
   out-of-bounds memory access when distributing state data across GPUs.
2. splitBatchedState: Used local dimension with global batch size, causing
   incorrect state size calculation and wrong number of states per GPU.
3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states,
   but in distributed mode it correctly returns only local subset.

Changes:
- Add singleStateDimension field to CuDensityMatState to track individual
  state dimension within a batch
- Fix batch index calculation using cuDensityMat API's batchModeLocation
- Update splitBatchedState to use singleStateDimension for correct sizing
- Update Python solver to handle distributed partial results correctly
- Add comprehensive MPI tests for distributed batched evolution scenarios

Signed-off-by: huaweil <huaweil@nvidia.com>
Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
copy-pr-bot bot pushed a commit that referenced this pull request Jan 28, 2026
…volution (#3771)

This commit fixes a critical bug in the distributed batched state handling
for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems.

Root causes fixed:
1. distributeBatchedStateData: Incorrect batch index calculation caused
   out-of-bounds memory access when distributing state data across GPUs.
2. splitBatchedState: Used local dimension with global batch size, causing
   incorrect state size calculation and wrong number of states per GPU.
3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states,
   but in distributed mode it correctly returns only local subset.

Changes:
- Add singleStateDimension field to CuDensityMatState to track individual
  state dimension within a batch
- Fix batch index calculation using cuDensityMat API's batchModeLocation
- Update splitBatchedState to use singleStateDimension for correct sizing
- Update Python solver to handle distributed partial results correctly
- Add comprehensive MPI tests for distributed batched evolution scenarios

Signed-off-by: huaweil <huaweil@nvidia.com>
Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
copy-pr-bot bot pushed a commit that referenced this pull request Jan 28, 2026
…volution (#3771)

This commit fixes a critical bug in the distributed batched state handling
for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems.

Root causes fixed:
1. distributeBatchedStateData: Incorrect batch index calculation caused
   out-of-bounds memory access when distributing state data across GPUs.
2. splitBatchedState: Used local dimension with global batch size, causing
   incorrect state size calculation and wrong number of states per GPU.
3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states,
   but in distributed mode it correctly returns only local subset.

Changes:
- Add singleStateDimension field to CuDensityMatState to track individual
  state dimension within a batch
- Fix batch index calculation using cuDensityMat API's batchModeLocation
- Update splitBatchedState to use singleStateDimension for correct sizing
- Update Python solver to handle distributed partial results correctly
- Add comprehensive MPI tests for distributed batched evolution scenarios

Signed-off-by: huaweil <huaweil@nvidia.com>
Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
copy-pr-bot bot pushed a commit that referenced this pull request Jan 28, 2026
…volution (#3771)

This commit fixes a critical bug in the distributed batched state handling
for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems.

Root causes fixed:
1. distributeBatchedStateData: Incorrect batch index calculation caused
   out-of-bounds memory access when distributing state data across GPUs.
2. splitBatchedState: Used local dimension with global batch size, causing
   incorrect state size calculation and wrong number of states per GPU.
3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states,
   but in distributed mode it correctly returns only local subset.

Changes:
- Add singleStateDimension field to CuDensityMatState to track individual
  state dimension within a batch
- Fix batch index calculation using cuDensityMat API's batchModeLocation
- Update splitBatchedState to use singleStateDimension for correct sizing
- Update Python solver to handle distributed partial results correctly
- Add comprehensive MPI tests for distributed batched evolution scenarios

Signed-off-by: huaweil <huaweil@nvidia.com>
Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
taalexander pushed a commit to taalexander/cuda-quantum that referenced this pull request Jan 30, 2026
…volution (NVIDIA#3771)

This commit fixes a critical bug in the distributed batched state handling
for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems.

Root causes fixed:
1. distributeBatchedStateData: Incorrect batch index calculation caused
   out-of-bounds memory access when distributing state data across GPUs.
2. splitBatchedState: Used local dimension with global batch size, causing
   incorrect state size calculation and wrong number of states per GPU.
3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states,
   but in distributed mode it correctly returns only local subset.

Changes:
- Add singleStateDimension field to CuDensityMatState to track individual
  state dimension within a batch
- Fix batch index calculation using cuDensityMat API's batchModeLocation
- Update splitBatchedState to use singleStateDimension for correct sizing
- Update Python solver to handle distributed partial results correctly
- Add comprehensive MPI tests for distributed batched evolution scenarios

Signed-off-by: huaweil <huaweil@nvidia.com>
Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
taalexander pushed a commit to taalexander/cuda-quantum that referenced this pull request Jan 30, 2026
…volution (NVIDIA#3771)

This commit fixes a critical bug in the distributed batched state handling
for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems.

Root causes fixed:
1. distributeBatchedStateData: Incorrect batch index calculation caused
   out-of-bounds memory access when distributing state data across GPUs.
2. splitBatchedState: Used local dimension with global batch size, causing
   incorrect state size calculation and wrong number of states per GPU.
3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states,
   but in distributed mode it correctly returns only local subset.

Changes:
- Add singleStateDimension field to CuDensityMatState to track individual
  state dimension within a batch
- Fix batch index calculation using cuDensityMat API's batchModeLocation
- Update splitBatchedState to use singleStateDimension for correct sizing
- Update Python solver to handle distributed partial results correctly
- Add comprehensive MPI tests for distributed batched evolution scenarios

Signed-off-by: huaweil <huaweil@nvidia.com>
Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
taalexander pushed a commit to taalexander/cuda-quantum that referenced this pull request Jan 30, 2026
…volution (NVIDIA#3771)

This commit fixes a critical bug in the distributed batched state handling
for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems.

Root causes fixed:
1. distributeBatchedStateData: Incorrect batch index calculation caused
   out-of-bounds memory access when distributing state data across GPUs.
2. splitBatchedState: Used local dimension with global batch size, causing
   incorrect state size calculation and wrong number of states per GPU.
3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states,
   but in distributed mode it correctly returns only local subset.

Changes:
- Add singleStateDimension field to CuDensityMatState to track individual
  state dimension within a batch
- Fix batch index calculation using cuDensityMat API's batchModeLocation
- Update splitBatchedState to use singleStateDimension for correct sizing
- Update Python solver to handle distributed partial results correctly
- Add comprehensive MPI tests for distributed batched evolution scenarios

Signed-off-by: huaweil <huaweil@nvidia.com>
Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
taalexander added a commit that referenced this pull request Jan 30, 2026
* Update mgpu SHA to fix build and support DGX Spark (#3738)

Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Isolate Logger.h from fmtlib headers (#3764)

* Isolate Logger.h from fmtlib headers

Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Fix density matrix indexing bug in CuDensityMatState::operator() (#3768)

Bug: The operator() function used total dimension (dim*dim) instead of
single-side dimension (dim) for bounds checking and linear index
calculation when accessing density matrix elements.

For a 4x4 density matrix (dimension=16):
- Bug computed linear index as i * 16 + j (wrong)
- Correct is i * 4 + j

Impact:
- Valid indices like (1,1) would crash with CUDA memory error
- Invalid indices like (0,4) would silently pass bounds check

Fix: Use sqrt(dimension) to compute single-side dimension for both
bounds checking and linear index calculation.

Added regression tests in both Python and C++ test suites.

Signed-off-by: huaweil <huaweil@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Removing fmtlib headers from all headers (#3770)

Removing fmtlib headers from all headers

Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Reducing cut-and-paste on the backend unit tests with cmake function (#3773)

* Reducing cut-and-paste on the backend unit tests with cmake function

Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com>

* Removing debugging code

Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com>

---------

Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* [build] Optional sanitizers build flags in build script (#3772)

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* [monolithic change] Rework the way CUDA-Q embeds in Python (#3693)

[Python] Total rewrite of the python/CUDA-Q interface.

The current implementation of the Python handling of CUDA-Q has baked in
various attempts to deal with the language coupling between Python and
CUDA-Q kernels. These solutions have been accumulating and making it more
and more difficult to work on the Python implementation.

These changes are a total rewrite to bring the Python implementation more
closely aligned with the C++ implementation.

Changes:
  - The kernel builder and kernel decorator are fundamentally different
    and will no longer share a duck-typed interface. It doesn't work well.
    The builder assembles a CUDA-Q kernel dynamically. As such all symbolic
    references are known immediately. The decorator converts a static AST
    of code into a CUDA-Q kernel. Symbolic references are either local or
    not. Non-local symbols are unknown at the point the decorator is
    processed. All non-local symbols in a decorator are recorded with the
    decorator itself and lambda lifted as actual arguments.
  - MLIR requires that symbols be uniqued. The previous implementation ignored
    this requirement.
  - Lazy state maintenance in Python and the C++ runtime layers is buggy and
    not needed. It is removed. This includes dangling MLIR bindings from the
    AST bridge's python MLIR bindings.
  - Kernels are no longer built with assumptions, then rebuilt when those
    guesses prove wrong. Kernels are no longer built and rebuilt for different
    steps in the process. A kernel decorator builds a target agnostic, context
    independent kernel, and saves that MLIR ModuleOp under a unique name.
  - Launch scenarios have been reworked and consolidated to use the ModuleOp
    directly instead of shuffling between string representations (possibly
    under maps that were not thread-safe) and ModuleOp instances.
  - Every step of the process creating a brand new MLIRContext and loading all
    the dialects into that context, etc. is removed. This is done once and the
    Python interpreter uses the same context to build all modules.

Other changes include:

Fix GIL issue in get_state_async.

Restructure lambda lifting so it handles recursive ops.

Clone the ModuleOps as they go into the make_copyable_function closure
to prevent them from being erased along the way.

Remove VQE tests. Use VQE from CUDA-QX!

Simplifying cudaq::state support.

Handle kernel decorator from import module

Simplify the symbol table. Python is not a scoped language other than LEGB.

Convert kernel builder to generate code compatible with C++ for state initialization of veqs.

Refactor the AST bridge to generate state objects from the runtime.

Fixes for various tests.

and many other changes!

Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
Co-authored-by: Thien Nguyen <thiennguyen@nvidia.com>
Co-authored-by: Bettina Heim <heimb@outlook.com>
Co-authored-by: Sachin Pisal <spisal@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Fix segfault in splitBatchedState for distributed multi-GPU batched evolution (#3771)

This commit fixes a critical bug in the distributed batched state handling
for cudaq.evolve() with store_intermediate_results=ALL on multi-GPU systems.

Root causes fixed:
1. distributeBatchedStateData: Incorrect batch index calculation caused
   out-of-bounds memory access when distributing state data across GPUs.
2. splitBatchedState: Used local dimension with global batch size, causing
   incorrect state size calculation and wrong number of states per GPU.
3. cudm_solver.py: Assumed splitBatchedState returns all batch_size states,
   but in distributed mode it correctly returns only local subset.

Changes:
- Add singleStateDimension field to CuDensityMatState to track individual
  state dimension within a batch
- Fix batch index calculation using cuDensityMat API's batchModeLocation
- Update splitBatchedState to use singleStateDimension for correct sizing
- Update Python solver to handle distributed partial results correctly
- Add comprehensive MPI tests for distributed batched evolution scenarios

Signed-off-by: huaweil <huaweil@nvidia.com>
Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* [Dynamics] Reduce data type conversion overhead in Torch integrator implementation  (#3779)

* Reduce overhead in data type conversion

Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>

* Fix test

Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>

* Fix target info for dynamics

Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>

---------

Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* fix(python): add qpu_id parameter to observe function (#3739)

Signed-off-by: AndrewTKent <57512897+AndrewTKent@users.noreply.github.com>
Signed-off-by: Luca Mondada <luca@mondada.net>
Co-authored-by: Luca Mondada <luca@mondada.net>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Python <-> C interoperability (#3791)

* Python <-> C interoperability

This replaces the old interop interface with one that hides more of
the implementation details about the compiler and runtime libraries
from the user.

This is not quite as dynamic as native Python as kernel decorators
may be resolved in a python interpreter context *before* they are
used. This allows kernel decorator code to be pickled before use
by C++ kernel code, which in turn allows the C++ code to largely be
unaware that the kernel it is calling isn't just another C++ kernel.
The one caveat to this is that a qkernel object holding a pointer
to the entry point function of a kernel decorator must be annotated
so that the runtime layer can distinguish it and know it need not
try to find a host-side entry point in the C++ runtime dictionary.

Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* - Fix Linker args issue with OSX.
- Fix Zlib/minizip issue with OSX and brew packages

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Fix issues in the finding of C++ headers in a cross-platform manner (including osx/linux) by injecting through CMAKE instead of doing the dynamic lookup in the cpp program.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Add OSX isel workarounds and documentation of existing workarounds.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Continue working around dylib issues by disabling static linking if dylib is being used which is currently required for OSX. The stack of hacks is quickly growing though and we might consider dropping dylibs and going back to static linking.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* - Use semicolon-separated CMAKE_INSTALL_RPATH on macOS
- Use @executable_path instead of $ORIGIN for macOS rpath
- Fix nvq++ mktemp template
- Use COMPILER_FLAGS for backendConfig.cpp compilation
- Replace .so with %cudaq_plugin_ext in plugin tests
- Replace |& with 2>&1 | for POSIX shell compatibility
- Add DISCOVERY_TIMEOUT 120 to backend unit tests

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Revert to LLVM idempotent registration patch approach as I was unable to find a way to fix the issue with static initializers.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Revert to static LLVM/MLIR linking with flat_namespace on macOS

- Update CMake configuration to use static LLVM/MLIR libraries instead of dylib.
- Add -Wl,-flat_namespace linker flag for macOS symbol resolution.
- Fix use-after-free bug in LoopUnrollPatterns.inc when allow-early-exit is enabled.
- Replace bash-only |& syntax with POSIX-compatible 2>&1 | in test files.
- Remove mlir_execution_engine_shared.diff patch (no longer needed).
- Update build_llvm.sh script for static library build.
- Fix force loading to be a bit more careful about where used with LLVMCodeGen

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Force loading of LLVMCodeGen to fix availability issues with fast-isel due to two-level namespacing in OSX.

Bugs uncovered on OSX:
- ConvertStmt.cpp: Fix use-after-free by saving operands before erasing call.
- CombineMeasurements.cpp: Return success() after erase/replace ops for rewriter/
- RegToMem.cpp: Return WalkResult::skip() after erasing ops during walk.
- ArgumentConversion.cpp: Fix lambda capture by value instead of reference.

Updating tests for OSX:
- Fix regex patterns for OSX ABI.
- A few floating point regexes
- Mark one of the KAK tests as unsupported on OSX as the values are different due to a different found decomposition. We could have used the OSX values but it seemed lik
e overkill.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Address review comments. Remove unused additions and add new comments. Add new requirements-dev.txt. Add new library target for utilities to work around issues with link order.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Test fixes and linker fixes from previous changes with full test rerun.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Fix for braket tests.
- Make openssl build use cmake to avoid package config resolution issues with flat namespace static linking
- Force linking of cudaq common to use two-level namespace to prevent collisions/bugs with flat namespace and open ssl.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Updating year for the copyright header (#3710)

* Upgrading year for the copyright header

* updating year to 2026

* updating config

* restoring headers for contributors

Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Fixes required to build from fresh on MacOS ARM.

Some tests failing. Will fix in upcoming PRs.

Also migrated commits/building docs from macos wheel support PR for build system as these are better suited for this PR.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Restore build_wheel.sh script

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Remove orphaned files not in upstream/main

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Fix CMakeLists.txt: matrix.cpp is in operators/, not utils/

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Reset Pipelines.cpp to upstream (references deleted passes)

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Update include path: matrix.h moved from utils/ to operators/

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Fix: Apple ld doesn't support --start-group

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Fix: Apple ld doesn't support --unresolved-symbols

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Fix macOS RPATH: use semicolons and @loader_path

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Remove orphaned test_vqe.py (not in upstream/main)

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Reset test_assignments.py to upstream/main

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Restore macOS ARM64 JIT exception skip decorators for tests

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Restore macOS support in wheel scripts

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Reset test_resource_counter.py to upstream/main

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Restore macOS conditional CUDA dependencies in pyproject.toml

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Remove orphaned test_vqe_kernel.py (not in upstream/main)

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Reset test_unpack.py to upstream/main

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Reset test_kernel_shift_operators.py to upstream/main

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Add macOS ARM64 JIT exception skip to test_sample_in_choice

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Add macOS ARM64 JIT exception skip to test_unsupported_calls

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Revert "Add macOS ARM64 JIT exception skip to test_unsupported_calls"

This reverts commit 474e38c.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Revert copyright-only changes to match upstream/main

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Remove orphaned state_preparation tests and fix remaining copyrights

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Reset files to osx-cuda-quantum-support branch

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Delete orphaned files from python.redesign.0 branch

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Reset files to upstream/main

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Delete orphaned idempotent_option_registration.diff from python.redesign.0 branch

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Reset runtime/cudaq/CMakeLists.txt to osx-cuda-quantum-support

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Restore DISCOVERY_TIMEOUT 120 to gtest_discover_tests calls

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Reset test/AST-Quake/qalloc_initialization.cpp to osx-cuda-quantum-support

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Remove docs/OSX_BUILD.md

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Restore macOS-specific fixes from osx-cuda-quantum-support

- Add Python_SITELIB to PYTHONPATH for test_domains (fixes numpy not found)
- Add nvqir-qpp link to test_qudit for macOS
- Add UNSUPPORTED/XFAIL markers for darwin-arm64 on JIT exception tests

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Update qutip dependency to >5 to match upstream

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Clean up CMake comments: remove duplicate notes, add Apple linker comment

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Reset kernel_builder.cpp: remove unnecessary macOS guard

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Reset files to upstream/base: remove unnecessary macOS guards and whitespace changes

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Address review comments. Fix rpath and cu reference in build wheel.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Fix indentation.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Move macos skip to mark.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Code formatting.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Fix rebase failure undo changes.

Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>

* Address PR #3717 review comments

- Format bash scripts with shfmt (4-space indentation)
- Add missing newline at end of find_wheel_assets.sh
- Remove unnecessary darwin markers from pyproject.toml.cu12
  (macOS only uses cu13, so cu12 doesn't need darwin handling)

---------

Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Thomas Alexander <thomasalexander2718@gmail.com>
Signed-off-by: Renaud Kauffmann <rkauffmann@nvidia.com>
Signed-off-by: huaweil <huaweil@nvidia.com>
Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
Signed-off-by: AndrewTKent <57512897+AndrewTKent@users.noreply.github.com>
Signed-off-by: Luca Mondada <luca@mondada.net>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Co-authored-by: Thien Nguyen <58006629+1tnguyen@users.noreply.github.com>
Co-authored-by: Renaud Kauffmann <rkauffmann@nvidia.com>
Co-authored-by: huaweil <93200147+huaweil-nv@users.noreply.github.com>
Co-authored-by: Luca Mondada <72734770+lmondada@users.noreply.github.com>
Co-authored-by: Eric Schweitz <eschweitz@nvidia.com>
Co-authored-by: Thien Nguyen <thiennguyen@nvidia.com>
Co-authored-by: Bettina Heim <heimb@outlook.com>
Co-authored-by: Sachin Pisal <spisal@nvidia.com>
Co-authored-by: AndrewTKent <57512897+AndrewTKent@users.noreply.github.com>
Co-authored-by: Luca Mondada <luca@mondada.net>
@bettinaheim bettinaheim added the bug fix To be listed under Bug Fixes in the release notes label Mar 12, 2026
@bettinaheim bettinaheim added this to the release 0.14.0 milestone Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug fix To be listed under Bug Fixes in the release notes python-lang Anything related to the Python CUDA Quantum language implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants