diff --git a/projects/rocprofiler-sdk/source/bin/rocprofv3.py b/projects/rocprofiler-sdk/source/bin/rocprofv3.py index a0bddbedacc..39dc25352ec 100755 --- a/projects/rocprofiler-sdk/source/bin/rocprofv3.py +++ b/projects/rocprofiler-sdk/source/bin/rocprofv3.py @@ -750,7 +750,7 @@ def add_parser_bool_argument(gparser, *args, **kwargs): add_parser_bool_argument( filter_options, "--selected-regions", - help="If set, rocprofv3 will only profile regions of code surrounded by roctxProfilerResume(0) and roctxProfilerPause(0)", + help="If set, rocprofv3 will only profile regions of code surrounded by roctxProfilerResume(0) and roctxProfilerPause(0).", ) add_parser_bool_argument( filter_options, diff --git a/projects/rocprofiler-sdk/source/docs/data/perfetto_marker.png b/projects/rocprofiler-sdk/source/docs/data/perfetto_marker.png new file mode 100644 index 00000000000..ee50f45351f Binary files /dev/null and b/projects/rocprofiler-sdk/source/docs/data/perfetto_marker.png differ diff --git a/projects/rocprofiler-sdk/source/docs/data/perfetto_rccl.png b/projects/rocprofiler-sdk/source/docs/data/perfetto_rccl.png index 83c0b8dc2f7..292799f5c99 100644 Binary files a/projects/rocprofiler-sdk/source/docs/data/perfetto_rccl.png and b/projects/rocprofiler-sdk/source/docs/data/perfetto_rccl.png differ diff --git a/projects/rocprofiler-sdk/source/docs/how-to/using-rocprofiler-sdk-roctx.rst b/projects/rocprofiler-sdk/source/docs/how-to/using-rocprofiler-sdk-roctx.rst index 9bdf34c2932..69cb76411e3 100644 --- a/projects/rocprofiler-sdk/source/docs/how-to/using-rocprofiler-sdk-roctx.rst +++ b/projects/rocprofiler-sdk/source/docs/how-to/using-rocprofiler-sdk-roctx.rst @@ -27,6 +27,16 @@ Markers Markers are used to insert a marker in the code with a message. Creating markers helps you see when a line of code is executed. +When using ROCTx markers with profiling tools that support Perfetto trace visualization (such as ``rocprofv3`` or ``rocprof-sys``), markers created with ``roctxMark()`` appear as arrows in the Perfetto UI timeline. For example, markers like ``roctxMark("pre-launch")`` and ``roctxMark("post-kernel-launch")`` will be displayed as visual markers that point to the exact timestamp when that line of code was executed. + +To illustrate this, a call to ``roctxMark("iter")`` was added after the kernel launch in the timing loop of the `daxpy training example code `_. In the Perfetto trace generated by running ``rocprofv3 --runtime-trace`` and ``rocpd2pftrace``, we see visual markers at the end of each kernel launch in the timing loop ROCTx region. + +.. figure:: /data/perfetto_marker.png + :alt: ROCTx markers displayed as arrows in Perfetto UI + :align: center + + Example of ``roctxMark()`` annotations appearing as arrows in the Perfetto UI timeline + Ranges ======= @@ -161,6 +171,162 @@ The preceding command generates a ``hip_api_trace.csv`` file prefixed with the p "HIP_RUNTIME_API","hipFree",1643920,1643920,15,320301643320908,320301643511479 "HIP_RUNTIME_API","hipFree",1643920,1643920,16,320301643512629,320301643585639 +Profiler control with selected regions ++++++++++++++++++++++++++++++++++++++++ + +The ``roctxProfilerPause()`` and ``roctxProfilerResume()`` APIs can be used in two different ways depending on whether the ``--selected-regions`` option is used with ``rocprofv3``. + +Two modes of operation +======================= + +**Mode 1: Default behavior (without --selected-regions)** + +When running ``rocprofv3`` without the ``--selected-regions`` option, profiling starts **enabled** from the beginning of the application. The ``roctxProfilerPause()`` and ``roctxProfilerResume()`` APIs are used to temporarily hide specific sections of code from profiling. + +- Use case: "Profile everything except these specific regions" +- Profiler starts: **Enabled** +- ``roctxProfilerPause()``: Temporarily stops data collection +- ``roctxProfilerResume()``: Resumes data collection +- Example shown in the previous section demonstrates this mode + +**Mode 2: Selected regions profiling (with --selected-regions)** + +When running ``rocprofv3`` with the ``--selected-regions`` option, profiling starts **disabled** by default. Data collection only occurs within regions explicitly enclosed by ``roctxProfilerResume()`` and ``roctxProfilerPause()`` calls. + +- Use case: "Profile only these specific regions, ignore everything else" +- Profiler starts: **Disabled** +- ``roctxProfilerResume()``: Starts data collection +- ``roctxProfilerPause()``: Stops data collection +- All tracing and profiling options collect data **only** within the marked regions + +Using --selected-regions option +================================ + +The ``--selected-regions`` option enables profiling only within specific code regions that you explicitly mark. This provides fine-grained control over data collection, allowing you to focus profiling on specific regions of interest in your application. + +**Important:** When ``--selected-regions`` is enabled, **all** requested tracing or profiling data (kernel traces, API traces, memory copy traces, counter collection, etc.) will be collected **only** within the regions enclosed by ``roctxProfilerResume()`` and ``roctxProfilerPause()`` calls. This is not limited to marker traces—it controls all profiling activity. + +**Example with selected regions:** + +Instrument your code to mark regions of interest: + +.. code-block:: c++ + + #include + + // Initialization code (not profiled when using --selected-regions) + hipMalloc(&gpuMatrix, NUM * sizeof(float)); + hipMalloc(&gpuTransposeMatrix, NUM * sizeof(float)); + + // Start profiling for region 1 + roctxProfilerResume(0); + + // Region 1: Data transfer and computation (will be profiled) + hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice); + hipLaunchKernelGGL(matrixTranspose, + dim3(WIDTH / THREADS_PER_BLOCK_X, WIDTH / THREADS_PER_BLOCK_Y), + dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), + 0, 0, + gpuTransposeMatrix, gpuMatrix, WIDTH); + + // Stop profiling + roctxProfilerPause(0); + + // Code here is not profiled + someOtherFunction(); + + // Start profiling for region 2 + roctxProfilerResume(0); + + // Region 2: Result retrieval (will be profiled) + hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost); + + // Stop profiling + roctxProfilerPause(0); + + // Cleanup (not profiled) + hipFree(gpuMatrix); + hipFree(gpuTransposeMatrix); + +Run with selected regions enabled: + +.. code-block:: shell + + rocprofv3 --selected-regions --hip-trace --kernel-trace --output-format csv -- + +This command will collect HIP API traces and kernel traces **only** within the regions enclosed by ``roctxProfilerResume(0)`` and ``roctxProfilerPause(0)``. The initialization code, ``someOtherFunction()``, and cleanup code will not be profiled. + +**Multiple region profiling:** + +You can instrument your code with multiple ``roctxProfilerResume()``/``roctxProfilerPause()`` pairs throughout the application. The ``--selected-regions`` option will respect all of them, collecting data in each region where profiling is resumed. There is no need to provide a list of regions—all regions enclosed by the API calls will be profiled automatically. + +**Nested pause/resume pairs:** + +By default, each ``roctxProfilerResume()`` and ``roctxProfilerPause()`` call directly toggles profiling on or off. If you have nested pairs (e.g., a function with profiling control called from within another profiled region), the innermost call will affect the profiling state. To handle nested pairs with reference counting, use the ``--selected-regions-ref-count`` option, which uses reference counting for pause/resume calls and only toggles profiling when the outermost pair boundaries are crossed.. + +**Thread-specific control:** + +For more fine-grained control, you can use thread-specific pause/resume: + +.. code-block:: c++ + + roctx_thread_id_t tid; + roctxGetThreadId(&tid); + + roctxProfilerResume(tid); // Resume profiling on current thread only + // ... code to profile ... + roctxProfilerPause(tid); // Pause profiling on current thread only + +When using ``0`` as the thread ID argument, the control applies to all threads. When using a specific thread ID (obtained via ``roctxGetThreadId()``), the control applies only to that thread. + +**Combining with other profiling options:** + +The ``--selected-regions`` option works with all tracing and profiling options: + +- API tracing: ``--hip-trace``, ``--hsa-trace``, ``--marker-trace``, ``--rccl-trace``, etc. +- Kernel tracing: ``--kernel-trace`` +- Memory tracing: ``--memory-copy-trace``, ``--memory-allocation-trace``, ``--scratch-memory-trace`` +- Counter collection: ``--pmc`` +- Thread tracing: ``--advanced-thread-trace`` +- PC sampling: ``--pc-sampling-beta-enabled`` + +.. note:: + The ``--selected-regions`` option cannot be used together with ``--collection-period``. These are mutually exclusive profiling control mechanisms. Use ``--selected-regions`` for code-based control and ``--collection-period`` for time-based control. + +**Comparison summary:** + +.. list-table:: roctxProfilerPause/Resume behavior comparison + :header-rows: 1 + :widths: 30 35 35 + + * - Aspect + - Without --selected-regions + - With --selected-regions + + * - Initial profiler state + - Enabled (profiling active) + - Disabled (profiling inactive) + + * - roctxProfilerPause() effect + - Temporarily hides code from profiling + - Stops profiling in a region + + * - roctxProfilerResume() effect + - Resumes profiling after pause + - Starts profiling in a region + + * - Use case + - Profile everything except marked regions + - Profile only marked regions + + * - Typical workflow + - Exclude uninteresting regions + - Include only interesting regions + + * - Data collected + - All code except paused regions + - Only resumed regions + Resource naming ++++++++++++++++ diff --git a/projects/rocprofiler-sdk/source/docs/how-to/using-rocprofv3.rst b/projects/rocprofiler-sdk/source/docs/how-to/using-rocprofv3.rst index 52014946b40..55b8e293590 100644 --- a/projects/rocprofiler-sdk/source/docs/how-to/using-rocprofv3.rst +++ b/projects/rocprofiler-sdk/source/docs/how-to/using-rocprofv3.rst @@ -538,11 +538,15 @@ The trace output is captured in a rocpd database file and can be converted to pf The preceding command generates a rocpd database file prefixed with the process ID which can be converted to pftrace to be visualized in Perfetto UI. + .. code-block:: shell $ /opt/rocm/bin/rocpd2pftrace -i 163852_results.db -Here is the RCCL trace visualized in Perfetto UI: +The following image visualizes the ``RCCL`` trace for the referenced `allreduce_rccl sample application `_ using the Perfetto UI. +The host thread track and select compute streams have been pinned in the visualization to enhance readability. +This enables clear observation of the ``RCCL`` compute kernels launched during ``ncclAllReduce`` operations on the host thread. + .. image:: /data/perfetto_rccl.png