Skip to content
2 changes: 1 addition & 1 deletion projects/rocprofiler-sdk/source/bin/rocprofv3.py
Original file line number Diff line number Diff line change
Expand Up @@ -750,7 +750,7 @@ def add_parser_bool_argument(gparser, *args, **kwargs):
add_parser_bool_argument(
filter_options,
"--selected-regions",
help="If set, rocprofv3 will only profile regions of code surrounded by roctxProfilerResume(0) and roctxProfilerPause(0)",
help="If set, rocprofv3 will only profile regions of code surrounded by roctxProfilerResume(0) and roctxProfilerPause(0).",
)
add_parser_bool_argument(
filter_options,
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified projects/rocprofiler-sdk/source/docs/data/perfetto_rccl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,16 @@ Markers

Markers are used to insert a marker in the code with a message. Creating markers helps you see when a line of code is executed.

When using ROCTx markers with profiling tools that support Perfetto trace visualization (such as ``rocprofv3`` or ``rocprof-sys``), markers created with ``roctxMark()`` appear as arrows in the Perfetto UI timeline. For example, markers like ``roctxMark("pre-launch")`` and ``roctxMark("post-kernel-launch")`` will be displayed as visual markers that point to the exact timestamp when that line of code was executed.

To illustrate this, a call to ``roctxMark("iter")`` was added after the kernel launch in the timing loop of the `daxpy training example code <https://github.com/amd/HPCTrainingExamples/blob/main/HIP-Optimizations/daxpy/daxpy_5.hip>`_. In the Perfetto trace generated by running ``rocprofv3 --runtime-trace`` and ``rocpd2pftrace``, we see visual markers at the end of each kernel launch in the timing loop ROCTx region.

.. figure:: /data/perfetto_marker.png
:alt: ROCTx markers displayed as arrows in Perfetto UI
:align: center

Example of ``roctxMark()`` annotations appearing as arrows in the Perfetto UI timeline

Ranges
=======

Expand Down Expand Up @@ -161,6 +171,162 @@ The preceding command generates a ``hip_api_trace.csv`` file prefixed with the p
"HIP_RUNTIME_API","hipFree",1643920,1643920,15,320301643320908,320301643511479
"HIP_RUNTIME_API","hipFree",1643920,1643920,16,320301643512629,320301643585639

Profiler control with selected regions
+++++++++++++++++++++++++++++++++++++++

The ``roctxProfilerPause()`` and ``roctxProfilerResume()`` APIs can be used in two different ways depending on whether the ``--selected-regions`` option is used with ``rocprofv3``.

Two modes of operation
=======================

**Mode 1: Default behavior (without --selected-regions)**

When running ``rocprofv3`` without the ``--selected-regions`` option, profiling starts **enabled** from the beginning of the application. The ``roctxProfilerPause()`` and ``roctxProfilerResume()`` APIs are used to temporarily hide specific sections of code from profiling.

- Use case: "Profile everything except these specific regions"
- Profiler starts: **Enabled**
- ``roctxProfilerPause()``: Temporarily stops data collection
- ``roctxProfilerResume()``: Resumes data collection
- Example shown in the previous section demonstrates this mode

**Mode 2: Selected regions profiling (with --selected-regions)**

When running ``rocprofv3`` with the ``--selected-regions`` option, profiling starts **disabled** by default. Data collection only occurs within regions explicitly enclosed by ``roctxProfilerResume()`` and ``roctxProfilerPause()`` calls.

- Use case: "Profile only these specific regions, ignore everything else"
- Profiler starts: **Disabled**
- ``roctxProfilerResume()``: Starts data collection
- ``roctxProfilerPause()``: Stops data collection
- All tracing and profiling options collect data **only** within the marked regions

Using --selected-regions option
================================

The ``--selected-regions`` option enables profiling only within specific code regions that you explicitly mark. This provides fine-grained control over data collection, allowing you to focus profiling on specific regions of interest in your application.

**Important:** When ``--selected-regions`` is enabled, **all** requested tracing or profiling data (kernel traces, API traces, memory copy traces, counter collection, etc.) will be collected **only** within the regions enclosed by ``roctxProfilerResume()`` and ``roctxProfilerPause()`` calls. This is not limited to marker traces—it controls all profiling activity.

**Example with selected regions:**

Instrument your code to mark regions of interest:

.. code-block:: c++

#include <rocprofiler-sdk-roctx/roctx.h>

// Initialization code (not profiled when using --selected-regions)
hipMalloc(&gpuMatrix, NUM * sizeof(float));
hipMalloc(&gpuTransposeMatrix, NUM * sizeof(float));

// Start profiling for region 1
roctxProfilerResume(0);

// Region 1: Data transfer and computation (will be profiled)
hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);
hipLaunchKernelGGL(matrixTranspose,
dim3(WIDTH / THREADS_PER_BLOCK_X, WIDTH / THREADS_PER_BLOCK_Y),
dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y),
0, 0,
gpuTransposeMatrix, gpuMatrix, WIDTH);

// Stop profiling
roctxProfilerPause(0);

// Code here is not profiled
someOtherFunction();

// Start profiling for region 2
roctxProfilerResume(0);

// Region 2: Result retrieval (will be profiled)
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);

// Stop profiling
roctxProfilerPause(0);

// Cleanup (not profiled)
hipFree(gpuMatrix);
hipFree(gpuTransposeMatrix);

Run with selected regions enabled:

.. code-block:: shell

rocprofv3 --selected-regions --hip-trace --kernel-trace --output-format csv -- <application_path>

This command will collect HIP API traces and kernel traces **only** within the regions enclosed by ``roctxProfilerResume(0)`` and ``roctxProfilerPause(0)``. The initialization code, ``someOtherFunction()``, and cleanup code will not be profiled.

**Multiple region profiling:**

You can instrument your code with multiple ``roctxProfilerResume()``/``roctxProfilerPause()`` pairs throughout the application. The ``--selected-regions`` option will respect all of them, collecting data in each region where profiling is resumed. There is no need to provide a list of regions—all regions enclosed by the API calls will be profiled automatically.

**Nested pause/resume pairs:**

By default, each ``roctxProfilerResume()`` and ``roctxProfilerPause()`` call directly toggles profiling on or off. If you have nested pairs (e.g., a function with profiling control called from within another profiled region), the innermost call will affect the profiling state. To handle nested pairs with reference counting, use the ``--selected-regions-ref-count`` option, which uses reference counting for pause/resume calls and only toggles profiling when the outermost pair boundaries are crossed..

**Thread-specific control:**

For more fine-grained control, you can use thread-specific pause/resume:

.. code-block:: c++

roctx_thread_id_t tid;
roctxGetThreadId(&tid);

roctxProfilerResume(tid); // Resume profiling on current thread only
// ... code to profile ...
roctxProfilerPause(tid); // Pause profiling on current thread only

When using ``0`` as the thread ID argument, the control applies to all threads. When using a specific thread ID (obtained via ``roctxGetThreadId()``), the control applies only to that thread.

**Combining with other profiling options:**

The ``--selected-regions`` option works with all tracing and profiling options:

- API tracing: ``--hip-trace``, ``--hsa-trace``, ``--marker-trace``, ``--rccl-trace``, etc.
- Kernel tracing: ``--kernel-trace``
- Memory tracing: ``--memory-copy-trace``, ``--memory-allocation-trace``, ``--scratch-memory-trace``
- Counter collection: ``--pmc``
- Thread tracing: ``--advanced-thread-trace``
- PC sampling: ``--pc-sampling-beta-enabled``

.. note::
The ``--selected-regions`` option cannot be used together with ``--collection-period``. These are mutually exclusive profiling control mechanisms. Use ``--selected-regions`` for code-based control and ``--collection-period`` for time-based control.

**Comparison summary:**

.. list-table:: roctxProfilerPause/Resume behavior comparison
:header-rows: 1
:widths: 30 35 35

* - Aspect
- Without --selected-regions
- With --selected-regions

* - Initial profiler state
- Enabled (profiling active)
- Disabled (profiling inactive)

* - roctxProfilerPause() effect
- Temporarily hides code from profiling
- Stops profiling in a region

* - roctxProfilerResume() effect
- Resumes profiling after pause
- Starts profiling in a region

* - Use case
- Profile everything except marked regions
- Profile only marked regions

* - Typical workflow
- Exclude uninteresting regions
- Include only interesting regions

* - Data collected
- All code except paused regions
- Only resumed regions

Resource naming
++++++++++++++++

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -538,11 +538,15 @@ The trace output is captured in a rocpd database file and can be converted to pf

The preceding command generates a rocpd database file prefixed with the process ID which can be converted to pftrace to be visualized in Perfetto UI.


.. code-block:: shell

$ /opt/rocm/bin/rocpd2pftrace -i 163852_results.db

Here is the RCCL trace visualized in Perfetto UI:
The following image visualizes the ``RCCL`` trace for the referenced `allreduce_rccl sample application <https://github.com/bgopesh/allreduce_rccl/blob/master/nccl_allreduce.cpp>`_ using the Perfetto UI.
The host thread track and select compute streams have been pinned in the visualization to enhance readability.
This enables clear observation of the ``RCCL`` compute kernels launched during ``ncclAllReduce`` operations on the host thread.


.. image:: /data/perfetto_rccl.png

Expand Down
Loading