[rocprofiler-sdk] Optimize HSA queue write interceptor and async signal handler by jrmadsen · Pull Request #4276 · ROCm/rocm-systems

jrmadsen · 2026-03-20T23:37:00Z

Motivation

Rewrites how rocprofiler-sdk handles the signal creation and signal async handlers in queue interception.

Technical Details

Creates an initial batch of 4096 signals and creates new batches of 4096 as needed.
Only assigns async signal handler to last packet in a batch of packets

JIRA ID

Test Plan

Ideally, this just improves performance and any breakages will be detected in the existing tests.
Developing a test to prevent a performance regression will difficult.

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR refactors rocprofiler-sdk’s HSA queue write interception and async signal handling to reduce per-dispatch overhead by batching per-packet state and introducing pooled/batched HSA signals.

Changes:

Introduces packet_data_t and updates completion callbacks to operate on per-packet data rather than session-wide fields.
Adds a pooled signal infrastructure (pool/pool_object) and rewires queue interception to allocate/reuse signals in batches.
Adds a new HIP test binary (hip-graph-bubbles) intended to create many graph-based kernel dispatches.

Reviewed changes

Copilot reviewed 31 out of 34 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
projects/rocprofiler-sdk/tests/bin/hip-graph-bubbles/hip-graph-bubbles.cpp	New test program that builds/launches a HIP graph repeatedly with roctx ranges.
projects/rocprofiler-sdk/tests/bin/hip-graph-bubbles/CMakeLists.txt	Build rules for the new `hip-graph-bubbles` test binary.
projects/rocprofiler-sdk/tests/bin/CMakeLists.txt	Adds `hip-graph-bubbles` subdirectory to the test build.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/tracing/fwd.hpp	Changes external correlation map type to a small_vector-backed container.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/thread_trace/core.hpp	Updates `post_kernel_call` signature to take `packet_data_t`.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/thread_trace/core.cpp	Threads `packet_data_t.user_data` through post-dispatch data iteration.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/pc_sampling/tests/pc_sampling_internals.hpp	Updates session type name references for completion callback signatures.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp	Adapts to renamed session type and small_vector external correlation map.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/kernel_dispatch/tracing.hpp	Updates dispatch tracing APIs to use `queue_info_session_t` + `packet_data_t`.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/kernel_dispatch/tracing.cpp	Moves dispatch callback inputs from session-wide to per-packet storage.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/signal.hpp	Adds `signal_t` wrapper used by pooled signal objects.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue_info_session.hpp	Introduces `packet_data_t` and refactors session to hold a small_vector of packet data.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.hpp	Updates async completion callback signature and adds pooled-signal APIs.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp	Core refactor: batching packet data, pooled signals, and async handler changes.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/memory_allocation.cpp	Switches external correlation map alias to the new small_vector-backed type.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/async_copy.cpp	Switches external correlation map alias to the new small_vector-backed type.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/CMakeLists.txt	Adds `signal.hpp` to installed/compiled HSA headers list.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/tests/core.cpp	Updates tests for renamed session type and new completed_cb signature.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/sample_processing.hpp	Plumbs `packet_data_t` into callback processing params.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/sample_processing.cpp	Reads dispatch info/user_data/external corr IDs from `packet_data_t`.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/dispatch_handlers.hpp	Updates completed callback signature to include `packet_data_t`.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/dispatch_handlers.cpp	Passes `packet_data_t` through to sample processing.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/core.cpp	Updates controller callback wiring for new completed callback signature.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/code_object/code_object.cpp	Switches external correlation map alias to the new small_vector-backed type.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/buffer.cpp	Optimizes `get_buffer` lookup from linear scan to direct indexing.
projects/rocprofiler-sdk/source/lib/common/utility.hpp	Generalizes `get_val` to work with containers providing `find` (incl. small_vector pairs).
projects/rocprofiler-sdk/source/lib/common/mpl.hpp	Extends pair detection trait to expose `first_type` / `second_type`.
projects/rocprofiler-sdk/source/lib/common/container/static_vector.hpp	Simplifies emplace_back assignment path.
projects/rocprofiler-sdk/source/lib/common/container/stable_vector.hpp	Initializes members to defaults to avoid uninitialized state.
projects/rocprofiler-sdk/source/lib/common/container/small_vector.hpp	Adds map-like helpers for small_vector-of-pairs (`find`, `at`, `emplace`).
projects/rocprofiler-sdk/source/lib/common/container/record_header_buffer.cpp	Uses memset to clear only the used header range; adds `<cstring>`.
projects/rocprofiler-sdk/source/lib/common/container/pool_object.hpp	New pooled object wrapper with acquire/release semantics.
projects/rocprofiler-sdk/source/lib/common/container/pool.hpp	New pool implementation for batched reusable objects (used for signals).
projects/rocprofiler-sdk/source/lib/common/container/CMakeLists.txt	Adds new pool headers to the common container header list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T23:43:00Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

+    if(auto* pool = get_signal_pool(); use_pool && pool && attribute == 0)
+    {
+        auto& _signal = pool->acquire(construct_hsa_signal, 0, 0, nullptr, attribute);
+        ROCP_FATAL_IF(!_signal.in_use()) << "Acquired signal from pool that is not in use";
+        *signal = _signal.get().value;
+        // ROCP_INFO << fmt::format("acquired signal {} from pool: hsa_signal_t{{.handle={}}}",
+        //                          _signal.index(),
+        //                          _signal.get().value.handle);
+        get_core_table()->hsa_signal_store_screlease_fn(_signal.get().value, 1);
+        return &_signal;
+    }


The pooled-signal path calls pool->acquire(construct_hsa_signal, ...), which creates a brand new HSA signal each time the pooled object is acquired. That overwrites the previous hsa_signal_t handle stored in the pool object without destroying it, effectively leaking HSA signals and defeating reuse.

For pooled signals, create the HSA signal once when the pool batch is constructed and only reset its value on reuse (e.g., via hsa_signal_store_*), or explicitly destroy the previous handle before re-creating it.

Copilot · 2026-03-20T23:43:00Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

+    ROCP_INFO << fmt::format("WriteInterceptor called with pkt_count={}", pkt_count);
+
+    using callback_record_t = packet_data_t::callback_record_t;
+    using packet_vector_t   = common::container::small_vector<rocprofiler_packet, 512>;



ROCP_INFO logging inside WriteInterceptor will run on every intercepted queue write and can add significant overhead / log spam in performance-sensitive code paths.

Consider removing this log or demoting it to ROCP_TRACE / VLOG guarded by a debug flag.

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

projects/rocprofiler-sdk/source/lib/common/container/pool_object.hpp

projects/rocprofiler-sdk/source/lib/common/container/pool.hpp

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

powderluv · 2026-03-21T17:47:36Z

I pulled this PR into a clean local worktree and tried it against the same HIP graph kernel-trace repro cases we have been using for the queue/signal issue.

One build caveat first: on my ROCm 7.13 / TheRock venv, the PR head (9113c23e9d) does not build as-is because it is missing the separate fmt.hpp memory-copy-op compatibility fix for ROCm 7.13. I applied only that minimal compatibility patch locally, with no queue/signal behavior changes on top of the PR, so I could test the runtime behavior.

With that single compatibility patch added, I still could not get the PR branch to pass the HIP graph repro:

1000 x 300 with --kernel-trace segfaulted very early, before any CSV output was written.
256 x 200 with --kernel-trace also segfaulted before any CSV output was written.

I put the exact compatibility patch and the two crash logs into a secret gist here:

https://gist.github.com/powderluv/f65f4560fe338effd090fd7dd57d833d

Files in the gist:

README.md
pr4276_rocm713_compat.patch
pr4276_k1000_i300_run.log
pr4276_k256_i200_run.log

So at least on this setup, this alternative implementation is not yet passing the existing HIP graph test cases.

powderluv · 2026-03-21T20:35:23Z

I pulled this into a clean workspace and iterated on top of the PR head locally. The updated branch is here:

Local commit stack on top of the PR branch:

d33b45eda3 rocprofiler-sdk: handle ROCm 7.13 memory copy op layouts
d88004b100 rocprofiler-sdk: avoid host-thread state on async queue callbacks
5e6cd16418 rocprofiler-sdk: prearm queue completion callbacks for hip graphs

What changed at a high level:

stopped using host-thread-only state on ROCr async-doorbell callbacks
skipped tool-side kernel rename / HIP stream external-correlation setup when there is no host correlation id
switched the queue completion path to fresh one-shot pre-armed slots
kept pre-armed handlers alive until a real queue session is attached
changed the queue completion async-handler condition from EQ -1 to LT 1, which was the turning point for actually draining dispatch completions on this HIP graph case

Validation on the HIP graph reproducer (--kernel-trace, CSV output):

256 x 20: passes, 5120 rows / 5120 unique dispatch ids
256 x 200: passes, 51200 rows / 51200 unique dispatch ids
1000 x 200: passes, 200000 rows / 200000 unique dispatch ids
2000 x 200: passes, 400000 rows / 400000 unique dispatch ids

This is materially different from the original state I tested earlier on this machine, where the branch either failed to build on the ROCm 7.13 venv or segfaulted / failed to emit profiler output on the same HIP graph kernel-trace cases.

powderluv · 2026-03-22T06:49:37Z

I added a local hotspot pass on the current pr4276-based branch using the HIP graph reproducer with queue-signal timing enabled.

Method:

staged local rocprofv3 from the current pr4276 workspace
ROCPROFILER_QUEUE_SIGNAL_TRACE=1
ROCPROFILER_QUEUE_SIGNAL_TRACE_PERIOD=65536
compared the first ~65536 traced-dispatch summary on two shapes:
- 2000 x 300
- 3000 x 200

The main result is that the async completion callback is not the dominant performance hotspot.

At the first summary window:

2000 x 300
- dispatch_setup_avg_us=32.306
- completion_avg_us=1.265
- create_avg_us=0.641
- register_avg_us=1.262
- enqueue_latency_avg_us=4152.152
- direct_create_calls=24406
3000 x 200
- dispatch_setup_avg_us=33.673
- completion_avg_us=1.384
- create_avg_us=0.696
- register_avg_us=1.202
- enqueue_latency_avg_us=4457.345
- direct_create_calls=23946

Interpretation:

enqueue-side WriteInterceptor work is roughly 24x-26x larger than the async completion callback work
hsa_amd_signal_create and hsa_amd_signal_async_handler are visible, but neither is the dominant cost by itself
callback subphases are small:
- get_dispatch_avg_us ~ 0.116-0.125
- dispatch_complete_avg_us ~ 0.367-0.406
- callback_avg_us ~ 0.070-0.072
the queue is still accumulating noticeable completion lag (enqueue_latency_avg_us ~ 4.1-4.5 ms), but the direct callback body is not expensive enough to explain the overall slowdown
the prearmed slot path still falls back to direct creates frequently (~24k misses in the first ~65k dispatches), so slot availability is still part of the picture

The next useful step is finer instrumentation inside WriteInterceptor itself, especially around:

correlation / external-correlation work
tracing enter/exit callback overhead
queue callback fanout on enqueue
packet transformation / serialization path
slot-acquire miss path versus ready-slot hit path

So the current evidence says: optimize enqueue-side setup first, not async callback execution.

powderluv · 2026-03-22T07:01:40Z

Follow-up hotspot note from a second local instrumentation pass on the HIP graph repro.

I split the enqueue-side dispatch_setup_avg_us bucket into non-overlapping pieces on the current local pr4276 worktree and sampled the first ~65536 traced dispatches of two shapes:

2000 x 300
- dispatch_setup_avg_us=45.849
- dispatch_packet_avg_us=0.180
- dispatch_signal_avg_us=44.611
  - dispatch_signal_create_avg_us=44.392
  - dispatch_signal_arm_avg_us=0.219
- completion_avg_us=0.849
- enqueue_latency_avg_us=5814.085
- direct_create_calls=8834 / 65532
3000 x 200
- dispatch_setup_avg_us=54.235
- dispatch_packet_avg_us=0.488
- dispatch_signal_avg_us=52.278
  - dispatch_signal_create_avg_us=52.044
  - dispatch_signal_arm_avg_us=0.233
- completion_avg_us=1.618
- enqueue_latency_avg_us=6893.844
- direct_create_calls=1064 / 65521

Takeaway:

The main enqueue-side hotspot is the completion-signal acquisition / creation stage in WriteInterceptor, not packet building and not async-handler arm/register.
Packet build is sub-0.5 us here.
Arm/register is only about 0.22-0.23 us.
Completion callback work is still small (<2 us).
The wider graph shape (3000x200) is slower mainly because the signal-create/acquire stage grows, and enqueue latency grows with it.

One nuance: the raw create_avg_us counter for hsa_amd_signal_create itself is still sub-1 us, so this larger dispatch_signal_create_avg_us bucket is measuring the broader completion-signal acquisition path, not just the raw runtime call in isolation. That points more toward ready-slot acquisition / fallback / surrounding queue bookkeeping than the async callback path.

powderluv · 2026-03-22T08:37:34Z

Follow-up after cleaning up the local diff and updating the comparison branch.

I pushed a cleaned queue-only commit on top of users/powderluv/pr4276-hip-graph-fix:

b0db72c610 rocprofiler-sdk: use a ready queue for prearmed signals

What changed in this cleanup:

kept only the ready-queue optimization for prearmed completion slots
dropped the temporary hotspot instrumentation
kept the header-side async_signal_* type placement needed for a clean rebuild in this branch layout

Validated from a clean rebuild/stage in the venv-backed environment at:

/data/anush/github/bubble/SWDEV-583475/stage/rocprofiler-sdk-pr4276-push

Wide HIP graph kernel-trace reruns on the cleaned stage:

3000 x 200: passed on rerun, full CSV written
- log: /data/anush/github/bubble/SWDEV-583475/logs/hip-graph-cleanpush-k3000-i200-rerun-20260322T080944Z/run.log
- csv: /data/anush/github/bubble/SWDEV-583475/profiles/hip-graph-cleanpush-k3000-i200-rerun-20260322T080944Z/rocprofv3/trace_kernel_trace.csv
- result: 600000 rows / 600000 unique Dispatch_Id
2000 x 300: passed, full CSV written
- log: /data/anush/github/bubble/SWDEV-583475/logs/hip-graph-cleanpush-k2000-i300-20260322T080957Z/run.log
- csv: /data/anush/github/bubble/SWDEV-583475/profiles/hip-graph-cleanpush-k2000-i300-20260322T080957Z/rocprofv3/trace_kernel_trace.csv
- result: 600000 rows / 600000 unique Dispatch_Id

One caveat: the first fresh 3000 x 200 run after the clean rebuild hit a one-off hip::stream::get_stream_id segfault:

/data/anush/github/bubble/SWDEV-583475/logs/hip-graph-cleanpush-k3000-i200-20260322T080840Z/run.log

That fault did not reproduce on the immediate rerun above, and the second wide case also passed. So the ready-queue throughput fix is on the branch now, but there is still some residual instability outside the queue ready-queue path that may need a separate follow-up.

bwelton · 2026-03-23T18:23:12Z

Creates an initial batch of 4096 signals and creates new batches of 4096 as needed.

If it ever needs to exceed 4096, you may run into this exact issue again. There is a limit to the number of signals that can be created before polling must be used for all of them (i believe that limit is 4096).

Is this specifically only with kernel-trace? Do we have experiments that show that this change is enough to resolve the underlying problem?

bwelton · 2026-03-23T18:48:26Z

Given the discussion in https://amd-hub.atlassian.net/browse/ROCM-20396 as well. We should consider just doing the out of band solution for getting the profiling time for these kernels. It doesn't make much sense to hack on both sides here to get around an issue that could be resolved by just supporting out of band performance metrics collection. I suspect both of these independent solutions will be fragile in that they will either see performance degradation under different circumstances or experience bugs (which is more in relation to the HSA changes in the PR for ROCM_20396).

projects/rocprofiler-sdk/source/lib/common/container/pool.hpp

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

- fixed some thread safety concerns - fixed potential leaking of signals

bwelton · 2026-03-31T15:35:42Z

projects/rocprofiler-sdk/source/lib/common/container/record_header_buffer.cpp

+        // indicate the number of used elements.
+        if(_n > 0)
+        {
+            std::memset(m_headers.data(), 0, _n * sizeof(rocprofiler_record_header_t));


Why is this here? I am not opposed to the change but why this PR?

This got pulled in from @itrowbri's optimizations in another PR. He found some performance improvements so they are included as part of the overall performance improvement of the write interceptor

bwelton · 2026-03-31T15:36:46Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/buffer.cpp

+        // Use direct indexing instead of linear search (same pattern as destroy_buffer)
+        // See allocate_buffer below that the idx is assigned based on the size + address
+        auto  idx = buffer_id.handle - get_buffer_offset();
+        auto& buf = get_buffers()->at(idx);


Similar question to record_header_buffer.cpp, why is this specifically included in this PR?

bwelton · 2026-03-31T15:37:30Z

projects/rocprofiler-sdk/source/lib/common/container/static_vector.hpp

-            m_data[_idx] = {std::forward<Args>(_v)...};
-        else
-            m_data[_idx] = Tp{std::forward<Args>(_v)...};
+        m_data[_idx] = Tp{std::forward<Args>(_v)...};


Are there issues with constexpr above?

It had to do with is_assignable being malformed when sizeof...(Args) > 1

projects/rocprofiler-sdk/source/lib/common/container/pool_object.hpp

bwelton · 2026-03-31T15:45:40Z

projects/rocprofiler-sdk/source/lib/common/container/small_vector.hpp

@@ -297,6 +298,8 @@ class small_vector_template_common : public small_vector_base<small_vector_size_
    using value_type      = T;
    using iterator        = T*;
    using const_iterator  = const T*;
+    using key_type        = typename mpl::is_pair<T>::first_type;   // will be void if not pair
+    using mapped_type     = typename mpl::is_pair<T>::second_type;  // will be void if not pair


this is no longer a small vector but a flat map with this change. The change that needs this (external_correlation_id_map_t from map -> this flat map) doesn't seem like its necessary in this PR and is really a separate optimization.

bwelton · 2026-03-31T15:48:57Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

+    }
+
+    hsa_status_t status =
+        get_amd_ext_table()->hsa_amd_signal_create_fn(1, 0, nullptr, attribute, signal);


We should do a general check here for get_amd_ext_table vs _ext_api usage at some point. These seem to be non-uniformly used in this PR.

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

bwelton · 2026-03-31T16:08:43Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

+        {
+            ROCP_TRACE << fmt::format("Destroying interrupt signal {{.handle={}}}",
+                                      packet.interrupt_signal.handle);
+            hsa::get_core_table()->hsa_signal_destroy_fn(packet.interrupt_signal);


Would it be simpler here to maybe just used the pools for all signals? There is some higher risk of this change so we may want to do this separately.

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

bwelton · 2026-03-31T16:16:59Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

@@ -213,14 +311,17 @@ WriteInterceptor(const void* packets,
        return;
    }



There is a really nasty edge case when serialization and batching are both enabled. If they are both enabled, there seems like there would be a deadlock here since the batching would potentially cause kernel_completion_signal() to not be triggered when we expect it to be.

I would suggest we actually gate batching to not be used when serialization is enabled (the performance bubble's don't matter in serialized cases anyway).

ammarwa

PR Review: [rocprofiler-sdk] Optimize HSA queue write interceptor and async signal handler

Reviewed the signal pooling, batched packet processing, and refactored queue_info_session.

Found 2 critical, 2 important, 2 suggestions, and 1 nit.

🤖 Generated with Claude Code

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

projects/rocprofiler-sdk/source/lib/common/container/pool.hpp

ammarwa · 2026-03-31T16:44:36Z

projects/rocprofiler-sdk/source/lib/common/container/pool.hpp

+        }
+    }
+
+    return acquire();


💡 SUGGESTION: acquire() uses unbounded recursion for retry

After creating a new batch (lines 147-161), this calls return acquire(); recursively. Under extreme contention (many threads exhausting the pool simultaneously), the new batch could be consumed by other threads before this recursive call runs, leading to repeated batch creation and unbounded stack growth.

With 4096-element batches this is extremely unlikely in practice, but a while(true) loop would be strictly safer and equally readable:

while(true) { // ... try to acquire from m_available ... if(_idx.has_value()) { /* return */ } // ... create new batch if needed ... }

ammarwa · 2026-03-31T16:44:36Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/signal.hpp

+// pair of hsa signal and user data pointer for async handler
+struct signal_t
+{
+    // bool         handler_is_set = false;


💡 SUGGESTION: Remove commented-out members

These commented-out members (handler_is_set, data) appear to be leftover development code. They add noise and may confuse future readers about whether they should be re-enabled.

struct signal_t { hsa_signal_t value = {.handle = 0}; };

ammarwa · 2026-03-31T16:44:36Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

        // Copy kernel pkt, copy is to allow for signal to be modified
-        rocprofiler_packet kernel_pkt = packets_arr[i];
+        _packet_data.kernel_packet = packets_arr[i];
+        // create a referencce for short hand access


NIT: Typo

referencce → reference

jrmadsen · 2026-04-01T01:10:17Z

There is a larger problem with the entire system... The background thread for processing counters (which appears to have been implemented by @ApoKalipse-V) is asynchronously operating on signals after they've been released back into the pool. It will take some time to resolve this.

jrmadsen requested review from a team as code owners March 20, 2026 23:37

Copilot AI review requested due to automatic review settings March 20, 2026 23:37

jrmadsen changed the title ~~Users/jrmadsen/optimize hsa write interceptor~~ [rocprofiler-sdk] Optimize HSA queue write interceptor and async signal handler Mar 20, 2026

github-actions bot added the project: rocprofiler-sdk label Mar 20, 2026

Copilot started reviewing on behalf of jrmadsen March 20, 2026 23:38 View session

systems-assistant bot added the organization: ROCm label Mar 20, 2026

Copilot AI reviewed Mar 20, 2026

View reviewed changes

sadikarmagan assigned bwelton and ammarwa Mar 23, 2026

ammarwa reviewed Mar 24, 2026

View reviewed changes

projects/rocprofiler-sdk/source/lib/common/container/pool.hpp Outdated Show resolved Hide resolved

ammarwa reviewed Mar 24, 2026

View reviewed changes

projects/rocprofiler-sdk/source/lib/common/container/pool.hpp Show resolved Hide resolved

ammarwa reviewed Mar 24, 2026

View reviewed changes

projects/rocprofiler-sdk/source/lib/common/container/pool.hpp Show resolved Hide resolved

ammarwa reviewed Mar 24, 2026

View reviewed changes

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp Outdated Show resolved Hide resolved

ammarwa reviewed Mar 24, 2026

View reviewed changes

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp Show resolved Hide resolved

ammarwa reviewed Mar 24, 2026

View reviewed changes

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp Outdated Show resolved Hide resolved

ammarwa reviewed Mar 24, 2026

View reviewed changes

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp Outdated Show resolved Hide resolved

jrmadsen and others added 6 commits March 31, 2026 10:40

Add reproducer for hipGraphLaunch GPU activity bubbles

37ad36a

Changed use of unordered map to small vector for optimization purposes

b92eee9

Pool implementation for hsa signals

f030415

Optimize usage of AsyncSignalHandler

fc35b4f

Formatting fixes

c30b1d5

Addressed various review comments

8eba8be

- fixed some thread safety concerns - fixed potential leaking of signals

Fix resource deadlock when destroying signal pool

3f0a2bd

jrmadsen force-pushed the users/jrmadsen/optimize-hsa-write-interceptor branch from 63341bc to 3f0a2bd Compare March 31, 2026 15:40

bwelton reviewed Mar 31, 2026

View reviewed changes

ammarwa reviewed Mar 31, 2026

View reviewed changes

Fix miscellaneous bugs in implementation

d53af9d

		@@ -213,14 +311,17 @@ WriteInterceptor(const void* packets,
		return;
		}

Conversation

jrmadsen commented Mar 20, 2026

Motivation

Technical Details

JIRA ID

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

powderluv commented Mar 21, 2026

Uh oh!

powderluv commented Mar 21, 2026

Uh oh!

powderluv commented Mar 22, 2026

Uh oh!

powderluv commented Mar 22, 2026

Uh oh!

powderluv commented Mar 22, 2026

Uh oh!

bwelton commented Mar 23, 2026

Uh oh!

bwelton commented Mar 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ammarwa left a comment

Choose a reason for hiding this comment

PR Review: [rocprofiler-sdk] Optimize HSA queue write interceptor and async signal handler

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

💡 SUGGESTION: acquire() uses unbounded recursion for retry

Uh oh!

💡 SUGGESTION: `acquire()` uses unbounded recursion for retry