Skip to content

[rocprofiler-sdk] - unable to collect PMC data#4590

Open
ihhethan wants to merge 1 commit intodevelopfrom
users/ihhuang/ROCM-1214
Open

[rocprofiler-sdk] - unable to collect PMC data#4590
ihhethan wants to merge 1 commit intodevelopfrom
users/ihhuang/ROCM-1214

Conversation

@ihhethan
Copy link
Copy Markdown
Contributor

@ihhethan ihhethan commented Mar 31, 2026

Problem

When running rocprofv3 -A absolute --pmc SQ_WAVES -- roccap play <trace> , PMC counter data could not be collected and the command might caused SSH disconnect and node destabilization.

Motivation

Enable rocprofv3 --pmc to work correctly with roccap play (AQL trace replay)
without causing node destabilization or data loss.

Technical Details

See ticket ROCM-1214

JIRA ID

Resolves ROCM-1214

Test Plan

Run the following command on server with rocplaycap AQL trace replay:

rocprofv3 -A absolute --pmc SQ_WAVES --output-format csv \
  -d <output_dir> -- roccap play <trace.cap>

Verify:

  1. PMC counter data is correctly collected and written to CSV
  2. Node remains stable (no SSH disconnect, no GPU reset)
  3. dmesg is clean
  4. Process exits naturally without requiring Ctrl+C

Test Result

ROCm 7.1 — VALIDATED

ROCm 7.2 — PENDING

@ihhethan ihhethan requested review from a team as code owners March 31, 2026 16:31
@ihhethan ihhethan changed the title fix(rocprofiler-sdk): ROCM-1214 unable to collect PMC data with rocpr… rocprofiler-sdk : ROCM-1214 unable to collect PMC data with rocpr… Mar 31, 2026
@ihhethan ihhethan changed the title rocprofiler-sdk : ROCM-1214 unable to collect PMC data with rocpr… rocprofiler-sdk : ROCM-1214 unable to collect PMC data with rocprofv3 --pmc + roccap play on MI300X Mar 31, 2026
@ihhethan ihhethan force-pushed the users/ihhuang/ROCM-1214 branch from a63a2c9 to 13d83ba Compare March 31, 2026 16:35
@ihhethan ihhethan changed the title rocprofiler-sdk : ROCM-1214 unable to collect PMC data with rocprofv3 --pmc + roccap play on MI300X [rocprofiler-sdk] - unable to collect PMC data Mar 31, 2026
@ihhethan ihhethan force-pushed the users/ihhuang/ROCM-1214 branch 3 times, most recently from 3196fa9 to e2155c1 Compare March 31, 2026 17:06
Fix unable to collect PMC data when running rocprofv3 --pmc with roccap play.
The issue caused SSH disconnect and node destabilization due to duplicate
/dev/kfd opens in rocplaycap child processes, HSA runtime teardown race
conditions, and signal handler deadlock.

Fix profiler initialization, teardown race conditions, and signal handler
issues to ensure stable PMC data collection with roccap play.

Note: companion fixes for rocplaycap will be submitted separately.
@ihhethan ihhethan force-pushed the users/ihhuang/ROCM-1214 branch from e2155c1 to 03875f1 Compare March 31, 2026 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant