[rocprofiler-sdk] Optimize HSA queue write interceptor and async signal handler#4276
[rocprofiler-sdk] Optimize HSA queue write interceptor and async signal handler#4276
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors rocprofiler-sdk’s HSA queue write interception and async signal handling to reduce per-dispatch overhead by batching per-packet state and introducing pooled/batched HSA signals.
Changes:
- Introduces
packet_data_tand updates completion callbacks to operate on per-packet data rather than session-wide fields. - Adds a pooled signal infrastructure (pool/pool_object) and rewires queue interception to allocate/reuse signals in batches.
- Adds a new HIP test binary (
hip-graph-bubbles) intended to create many graph-based kernel dispatches.
Reviewed changes
Copilot reviewed 31 out of 34 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| projects/rocprofiler-sdk/tests/bin/hip-graph-bubbles/hip-graph-bubbles.cpp | New test program that builds/launches a HIP graph repeatedly with roctx ranges. |
| projects/rocprofiler-sdk/tests/bin/hip-graph-bubbles/CMakeLists.txt | Build rules for the new hip-graph-bubbles test binary. |
| projects/rocprofiler-sdk/tests/bin/CMakeLists.txt | Adds hip-graph-bubbles subdirectory to the test build. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/tracing/fwd.hpp | Changes external correlation map type to a small_vector-backed container. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/thread_trace/core.hpp | Updates post_kernel_call signature to take packet_data_t. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/thread_trace/core.cpp | Threads packet_data_t.user_data through post-dispatch data iteration. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/pc_sampling/tests/pc_sampling_internals.hpp | Updates session type name references for completion callback signatures. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp | Adapts to renamed session type and small_vector external correlation map. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/kernel_dispatch/tracing.hpp | Updates dispatch tracing APIs to use queue_info_session_t + packet_data_t. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/kernel_dispatch/tracing.cpp | Moves dispatch callback inputs from session-wide to per-packet storage. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/signal.hpp | Adds signal_t wrapper used by pooled signal objects. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue_info_session.hpp | Introduces packet_data_t and refactors session to hold a small_vector of packet data. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.hpp | Updates async completion callback signature and adds pooled-signal APIs. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp | Core refactor: batching packet data, pooled signals, and async handler changes. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/memory_allocation.cpp | Switches external correlation map alias to the new small_vector-backed type. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/async_copy.cpp | Switches external correlation map alias to the new small_vector-backed type. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/CMakeLists.txt | Adds signal.hpp to installed/compiled HSA headers list. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/tests/core.cpp | Updates tests for renamed session type and new completed_cb signature. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/sample_processing.hpp | Plumbs packet_data_t into callback processing params. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/sample_processing.cpp | Reads dispatch info/user_data/external corr IDs from packet_data_t. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/dispatch_handlers.hpp | Updates completed callback signature to include packet_data_t. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/dispatch_handlers.cpp | Passes packet_data_t through to sample processing. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/core.cpp | Updates controller callback wiring for new completed callback signature. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/code_object/code_object.cpp | Switches external correlation map alias to the new small_vector-backed type. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/buffer.cpp | Optimizes get_buffer lookup from linear scan to direct indexing. |
| projects/rocprofiler-sdk/source/lib/common/utility.hpp | Generalizes get_val to work with containers providing find (incl. small_vector pairs). |
| projects/rocprofiler-sdk/source/lib/common/mpl.hpp | Extends pair detection trait to expose first_type / second_type. |
| projects/rocprofiler-sdk/source/lib/common/container/static_vector.hpp | Simplifies emplace_back assignment path. |
| projects/rocprofiler-sdk/source/lib/common/container/stable_vector.hpp | Initializes members to defaults to avoid uninitialized state. |
| projects/rocprofiler-sdk/source/lib/common/container/small_vector.hpp | Adds map-like helpers for small_vector-of-pairs (find, at, emplace). |
| projects/rocprofiler-sdk/source/lib/common/container/record_header_buffer.cpp | Uses memset to clear only the used header range; adds <cstring>. |
| projects/rocprofiler-sdk/source/lib/common/container/pool_object.hpp | New pooled object wrapper with acquire/release semantics. |
| projects/rocprofiler-sdk/source/lib/common/container/pool.hpp | New pool implementation for batched reusable objects (used for signals). |
| projects/rocprofiler-sdk/source/lib/common/container/CMakeLists.txt | Adds new pool headers to the common container header list. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if(auto* pool = get_signal_pool(); use_pool && pool && attribute == 0) | ||
| { | ||
| auto& _signal = pool->acquire(construct_hsa_signal, 0, 0, nullptr, attribute); | ||
| ROCP_FATAL_IF(!_signal.in_use()) << "Acquired signal from pool that is not in use"; | ||
| *signal = _signal.get().value; | ||
| // ROCP_INFO << fmt::format("acquired signal {} from pool: hsa_signal_t{{.handle={}}}", | ||
| // _signal.index(), | ||
| // _signal.get().value.handle); | ||
| get_core_table()->hsa_signal_store_screlease_fn(_signal.get().value, 1); | ||
| return &_signal; | ||
| } |
There was a problem hiding this comment.
The pooled-signal path calls pool->acquire(construct_hsa_signal, ...), which creates a brand new HSA signal each time the pooled object is acquired. That overwrites the previous hsa_signal_t handle stored in the pool object without destroying it, effectively leaking HSA signals and defeating reuse.
For pooled signals, create the HSA signal once when the pool batch is constructed and only reset its value on reuse (e.g., via hsa_signal_store_*), or explicitly destroy the previous handle before re-creating it.
| ROCP_INFO << fmt::format("WriteInterceptor called with pkt_count={}", pkt_count); | ||
|
|
||
| using callback_record_t = packet_data_t::callback_record_t; | ||
| using packet_vector_t = common::container::small_vector<rocprofiler_packet, 512>; | ||
|
|
There was a problem hiding this comment.
ROCP_INFO logging inside WriteInterceptor will run on every intercepted queue write and can add significant overhead / log spam in performance-sensitive code paths.
Consider removing this log or demoting it to ROCP_TRACE / VLOG guarded by a debug flag.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp
Outdated
Show resolved
Hide resolved
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp
Outdated
Show resolved
Hide resolved
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp
Outdated
Show resolved
Hide resolved
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp
Outdated
Show resolved
Hide resolved
|
I pulled this PR into a clean local worktree and tried it against the same HIP graph kernel-trace repro cases we have been using for the queue/signal issue. One build caveat first: on my ROCm 7.13 / TheRock venv, the PR head ( With that single compatibility patch added, I still could not get the PR branch to pass the HIP graph repro:
I put the exact compatibility patch and the two crash logs into a secret gist here: https://gist.github.com/powderluv/f65f4560fe338effd090fd7dd57d833d Files in the gist:
So at least on this setup, this alternative implementation is not yet passing the existing HIP graph test cases. |
|
I pulled this into a clean workspace and iterated on top of the PR head locally. The updated branch is here:
Local commit stack on top of the PR branch:
What changed at a high level:
Validation on the HIP graph reproducer (
This is materially different from the original state I tested earlier on this machine, where the branch either failed to build on the ROCm 7.13 venv or segfaulted / failed to emit profiler output on the same HIP graph kernel-trace cases. |
|
I added a local hotspot pass on the current Method:
The main result is that the async completion callback is not the dominant performance hotspot. At the first summary window:
Interpretation:
The next useful step is finer instrumentation inside
So the current evidence says: optimize enqueue-side setup first, not async callback execution. |
|
Follow-up hotspot note from a second local instrumentation pass on the HIP graph repro. I split the enqueue-side
Takeaway:
One nuance: the raw |
|
Follow-up after cleaning up the local diff and updating the comparison branch. I pushed a cleaned queue-only commit on top of
What changed in this cleanup:
Validated from a clean rebuild/stage in the venv-backed environment at:
Wide HIP graph kernel-trace reruns on the cleaned stage:
One caveat: the first fresh
That fault did not reproduce on the immediate rerun above, and the second wide case also passed. So the ready-queue throughput fix is on the branch now, but there is still some residual instability outside the queue ready-queue path that may need a separate follow-up. |
If it ever needs to exceed 4096, you may run into this exact issue again. There is a limit to the number of signals that can be created before polling must be used for all of them (i believe that limit is 4096). Is this specifically only with kernel-trace? Do we have experiments that show that this change is enough to resolve the underlying problem? |
|
Given the discussion in https://amd-hub.atlassian.net/browse/ROCM-20396 as well. We should consider just doing the out of band solution for getting the profiling time for these kernels. It doesn't make much sense to hack on both sides here to get around an issue that could be resolved by just supporting out of band performance metrics collection. I suspect both of these independent solutions will be fragile in that they will either see performance degradation under different circumstances or experience bugs (which is more in relation to the HSA changes in the PR for ROCM_20396). |
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp
Outdated
Show resolved
Hide resolved
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp
Outdated
Show resolved
Hide resolved
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp
Outdated
Show resolved
Hide resolved
- fixed some thread safety concerns - fixed potential leaking of signals
63341bc to
3f0a2bd
Compare
| // indicate the number of used elements. | ||
| if(_n > 0) | ||
| { | ||
| std::memset(m_headers.data(), 0, _n * sizeof(rocprofiler_record_header_t)); |
There was a problem hiding this comment.
Why is this here? I am not opposed to the change but why this PR?
There was a problem hiding this comment.
This got pulled in from @itrowbri's optimizations in another PR. He found some performance improvements so they are included as part of the overall performance improvement of the write interceptor
| // Use direct indexing instead of linear search (same pattern as destroy_buffer) | ||
| // See allocate_buffer below that the idx is assigned based on the size + address | ||
| auto idx = buffer_id.handle - get_buffer_offset(); | ||
| auto& buf = get_buffers()->at(idx); |
There was a problem hiding this comment.
Similar question to record_header_buffer.cpp, why is this specifically included in this PR?
| m_data[_idx] = {std::forward<Args>(_v)...}; | ||
| else | ||
| m_data[_idx] = Tp{std::forward<Args>(_v)...}; | ||
| m_data[_idx] = Tp{std::forward<Args>(_v)...}; |
There was a problem hiding this comment.
Are there issues with constexpr above?
There was a problem hiding this comment.
It had to do with is_assignable being malformed when sizeof...(Args) > 1
| @@ -297,6 +298,8 @@ class small_vector_template_common : public small_vector_base<small_vector_size_ | |||
| using value_type = T; | |||
| using iterator = T*; | |||
| using const_iterator = const T*; | |||
| using key_type = typename mpl::is_pair<T>::first_type; // will be void if not pair | |||
| using mapped_type = typename mpl::is_pair<T>::second_type; // will be void if not pair | |||
There was a problem hiding this comment.
this is no longer a small vector but a flat map with this change. The change that needs this (external_correlation_id_map_t from map -> this flat map) doesn't seem like its necessary in this PR and is really a separate optimization.
| } | ||
|
|
||
| hsa_status_t status = | ||
| get_amd_ext_table()->hsa_amd_signal_create_fn(1, 0, nullptr, attribute, signal); |
There was a problem hiding this comment.
We should do a general check here for get_amd_ext_table vs _ext_api usage at some point. These seem to be non-uniformly used in this PR.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp
Outdated
Show resolved
Hide resolved
| { | ||
| ROCP_TRACE << fmt::format("Destroying interrupt signal {{.handle={}}}", | ||
| packet.interrupt_signal.handle); | ||
| hsa::get_core_table()->hsa_signal_destroy_fn(packet.interrupt_signal); |
There was a problem hiding this comment.
Would it be simpler here to maybe just used the pools for all signals? There is some higher risk of this change so we may want to do this separately.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp
Outdated
Show resolved
Hide resolved
| @@ -213,14 +311,17 @@ WriteInterceptor(const void* packets, | |||
| return; | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
There is a really nasty edge case when serialization and batching are both enabled. If they are both enabled, there seems like there would be a deadlock here since the batching would potentially cause kernel_completion_signal() to not be triggered when we expect it to be.
I would suggest we actually gate batching to not be used when serialization is enabled (the performance bubble's don't matter in serialized cases anyway).
ammarwa
left a comment
There was a problem hiding this comment.
PR Review: [rocprofiler-sdk] Optimize HSA queue write interceptor and async signal handler
Reviewed the signal pooling, batched packet processing, and refactored queue_info_session.
Found 2 critical, 2 important, 2 suggestions, and 1 nit.
🤖 Generated with Claude Code
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| return acquire(); |
There was a problem hiding this comment.
💡 SUGGESTION: acquire() uses unbounded recursion for retry
After creating a new batch (lines 147-161), this calls return acquire(); recursively. Under extreme contention (many threads exhausting the pool simultaneously), the new batch could be consumed by other threads before this recursive call runs, leading to repeated batch creation and unbounded stack growth.
With 4096-element batches this is extremely unlikely in practice, but a while(true) loop would be strictly safer and equally readable:
while(true)
{
// ... try to acquire from m_available ...
if(_idx.has_value()) { /* return */ }
// ... create new batch if needed ...
}| // pair of hsa signal and user data pointer for async handler | ||
| struct signal_t | ||
| { | ||
| // bool handler_is_set = false; |
There was a problem hiding this comment.
💡 SUGGESTION: Remove commented-out members
These commented-out members (handler_is_set, data) appear to be leftover development code. They add noise and may confuse future readers about whether they should be re-enabled.
struct signal_t
{
hsa_signal_t value = {.handle = 0};
};| // Copy kernel pkt, copy is to allow for signal to be modified | ||
| rocprofiler_packet kernel_pkt = packets_arr[i]; | ||
| _packet_data.kernel_packet = packets_arr[i]; | ||
| // create a referencce for short hand access |
There was a problem hiding this comment.
NIT: Typo
referencce → reference
|
There is a larger problem with the entire system... The background thread for processing counters (which appears to have been implemented by @ApoKalipse-V) is asynchronously operating on signals after they've been released back into the pool. It will take some time to resolve this. |
Motivation
Rewrites how rocprofiler-sdk handles the signal creation and signal async handlers in queue interception.
Technical Details
JIRA ID
Test Plan
Test Result
Submission Checklist