Skip to content

Commit a5d4cb4

Browse files
authored
[SYCL][DOC] Add design document for SYCL-RTC (#19698)
Signed-off-by: Julian Oppermann <[email protected]>
1 parent 9924275 commit a5d4cb4

File tree

2 files changed

+318
-0
lines changed

2 files changed

+318
-0
lines changed

sycl/doc/design/SYCL-RTC.md

Lines changed: 317 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,317 @@
1+
# SYCL Runtime Compilation
2+
3+
SYCL-RTC means using the
4+
[`kernel_compiler`](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_kernel_compiler.asciidoc)
5+
extension to wrap a SYCL source string comprised of kernel definitions in the
6+
[free-function syntax](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_free_function_kernels.asciidoc)
7+
into a `kernel_bundle` in the `ext_oneapi_source` state, which is then compiled
8+
into `exectuable` state by the extension's `build(...)` function. The feature is
9+
backed by an implementation inside the `sycl-jit` library, which exposes the
10+
modular, LLVM-based compiler tech behind DPC++ to be called by the SYCL runtime.
11+
This document gives an overview of the design.
12+
13+
```c++
14+
#include <sycl/sycl.hpp>
15+
namespace syclexp = sycl::ext::oneapi::experimental;
16+
17+
// ...
18+
19+
std::string sycl_source = R"""(
20+
  #include <sycl/sycl.hpp>
21+
 
22+
  extern "C" SYCL_EXT_ONEAPI_FUNCTION_PROPERTY((
23+
  sycl::ext::oneapi::experimental::nd_range_kernel<1>))
24+
  void vec_add(float* in1, float* in2, float* out){
25+
    size_t id = sycl::ext::oneapi::this_work_item::get_nd_item<1>()
26+
    .get_global_linear_id();
27+
    out[id] = in1[id] + in2[id];
28+
  }
29+
)""";
30+
31+
sycl::queue q;
32+
33+
auto source_bundle = syclexp::create_kernel_bundle_from_source(
34+
q.get_context(), syclexp::source_language::sycl, sycl_source);
35+
36+
// This design document explains what happens on the next line.
37+
auto exec_bundle = syclexp::build(source_bundle);
38+
```
39+
40+
## File-and-process-based prototype
41+
42+
The
43+
[first implementation](https://github.com/intel/llvm/blob/03cb2b25026f060149eb94c85b228e5b3a780588/sycl/source/detail/kernel_compiler/kernel_compiler_sycl.cpp#L254)
44+
of the `build(...)` function wrote the source string into a temporary file,
45+
invoked DPC++ on it with the `-fsycl-dump-device-code` flag to dump the device
46+
code to another file in SPIR-V format, and finally loaded that file back into
47+
the runtime, from where it was executed.
48+
49+
## The rationale for an in-memory compilation pipeline
50+
51+
Invoking the DPC++ executable as outlined in the previous section worked
52+
reasonably well to implement the basic `kernel_compiler` extension, but we
53+
observed several shortcomings:
54+
55+
- Functional completeness: Emitting a single SPIR-V file is sufficient for
56+
simple kernels, but more advanced device code may result in multiple *device
57+
images* comprised of SPIR-V binaries and accompanying metadata (*runtime
58+
properties*) that needs to be communicated to the runtime.
59+
- Robustness: Reading multiple dependent files from a temporary directory can be
60+
be fragile.
61+
- Performance: Multiple processes are launched by the compiler driver, and file
62+
I/O operations have a non-negligible overhead. The `-fsycl-dump-device-code`
63+
required the presence of a dummy `main()` to be added to the source string,
64+
and caused an unnecessary host compilation to be performed.
65+
- Security: Reading executable code from disk is a security concern, and users
66+
of an RTC-enabled application may be unaware that a compilation writing
67+
intermediate files is happening in the background.
68+
69+
These challenges ultimately motivated the design of the **in-memory compilation
70+
pipeline** based on the `sycl-jit` library which is now the default approach in
71+
DPC++ and the oneAPI product distribution since the 2025.2 release. This new
72+
approach leverages **modular compiler technology** to produce a faster, more
73+
feature-rich, more robust and safer implementation of the `kernel_compiler`
74+
extension.
75+
76+
The individual steps in the pipeline (frontend, device library linking,
77+
`sycl-post-link` and target format translation) are now invoked programmatically
78+
via an API inside the same process, and intermediate results are passed along as
79+
objects in memory. The code can be found in the
80+
[`compileSYCL(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/RTC.cpp#L57)
81+
function.
82+
83+
## Using the LibTooling API to compile the source string to an `llvm::Module`
84+
85+
[LibTooling](https://clang.llvm.org/docs/LibTooling.html) is a high-level API to
86+
write standalone tools based on Clang, such as linters, refactoring tools or
87+
static analysers. To use it, one defines a *tool action* to run on a set of
88+
files in a *virtual filesystem overlay*, which the frontend then processes
89+
according to a *compilation command database*.
90+
91+
For SYCL-RTC, the filesystem overlay is populated with files containing the
92+
source string and any virtual `include_files` (defined via the homonymous
93+
property). The compilation command is static and puts the frontend into
94+
`-fsycl-device-only` mode. Any user-given options (from the `build_options`
95+
property) are appended. Lastly, the implementation defines a custom tool action
96+
which runs the frontend until LLVM codegen, and then obtains ownership of the
97+
LLVM module.
98+
99+
This might be a slightly unusual way to use of LibTooling, but we found it works
100+
great for SYCL-RTC. The next sections explain the
101+
[`jit_compiler::compileDeviceCode(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L418)
102+
function in more detail.
103+
104+
### Step 1: Determine the path of the compiler installation
105+
106+
To set up up working frontend invocation, we need to know where to find
107+
supplemental files such as the SYCL headers. Normally, these paths are
108+
determined relative to the compiler executable, however in our case, the
109+
executable is actually the RTC-enabled application, which can reside in an
110+
arbitrary location. Instead, we use OS-specific logic inside
111+
[`getDPCPPRoot()`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L112)
112+
to determine the location of the shared library `sycl-jit.so` (or `.dll` on
113+
Windows) which contains the SYCL-RTC implementation. From its location, we can
114+
derive the compiler installation's root directory.
115+
116+
### Step 2: Collect command-line arguments
117+
118+
The next step is to collect the command-line arguments for the frontend
119+
invocation. The
120+
[`adjustArgs(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L320)
121+
function relies on Clang's option handling infrastructure to set the required
122+
options to enter the device compilation mode (`-fsycl-device-only`), set up the
123+
compiler environment, and select the target. Finally, any user-specified
124+
arguments passed via the
125+
[`build_options`](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_kernel_compiler.asciidoc#new-properties-for-the-build-and-compile-functions)
126+
property are appended to the list of command-line arguments.
127+
128+
### Step 3: Configure the `ClangTool`
129+
130+
Once we know the required command-line arguments, we can set up the compilation
131+
command database and an
132+
[instance](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L433)
133+
of the
134+
[`ClangTool`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/include/clang/Tooling/Tooling.h#L317)
135+
class, which provides the entry point to the LibTooling interface. As we'll be
136+
translating only a single file containing the source string, we construct a
137+
[`FixedCompilationDatabase`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/include/clang/Tooling/CompilationDatabase.h#L154)
138+
relative to the current working directory.
139+
140+
To implement the `kernel_compiler` extension cleanly, we need to capture all
141+
output (e.g. warnings and errors) from the frontend. The
142+
[`ClangDiagnosticsWrapper`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L274)
143+
class configures a
144+
[`TextDiagnosticsPrinter`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/include/clang/Frontend/TextDiagnosticPrinter.h#L27)
145+
to append all messages to a string maintained by our implementation to collect
146+
all output produced during the runtime compilation.
147+
148+
The configuration of the `ClangTool` instance continues in the
149+
[`setupTool`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L353)
150+
function. First, we redirect all output to our diagnostics wrapper. Then, we
151+
[set up the overlay
152+
filesystem](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L361-L364)
153+
with a file named `rtc_<n>.cpp` (*n* is incremented for each use of the
154+
`kernel_compiler` extension's `build(...)` function) in the current directory
155+
with the contents of the source string. Each of the virtual header files that
156+
the application defined via the
157+
[`include_files`](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_kernel_compiler.asciidoc#new-properties-for-the-create_kernel_bundle_from_source-function)
158+
property becomes also a file in the overlay filesystem, using the path specified
159+
in the property.
160+
161+
The `ClangTool` class exposes so-called argument adjusters, which are intended
162+
to modify the command-line arguments coming from the compilation command
163+
database. We have to
164+
[clear](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L368)
165+
the default adjusters defined by the class, because one of them injects the
166+
`-fsyntax-only` flag, which would conflict with the `-fsycl-device-only` flag we
167+
need for SYCL-RTC. Finally, we
168+
[add](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L371)
169+
an argument adjuster ourselves to overwrite the name of executable in the
170+
invocation. Again, this is to help the correct detection of the environment, by
171+
making the invocation as similar as possible to a normal use of DPC++.
172+
173+
### Step 4: Run an action
174+
175+
The last step is to define a
176+
[`ToolAction`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/include/clang/Tooling/Tooling.h#L80)
177+
to be executed on the source files. Clang conveniently provides the
178+
[`EmitLLVMAction`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/include/clang/CodeGen/CodeGenAction.h#L103),
179+
which runs the frontend up until the LLVM IR code generation, which is exactly
180+
what we need. However, LibTooling does not provides a helper to wrap it in a
181+
`ToolAction`, so we need to define and run our own
182+
[`GetLLVMModuleAction`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L241).
183+
184+
We extracted common boilerplate code to configure a
185+
[`CompilerInstance`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/include/clang/Frontend/CompilerInstance.h#L81)
186+
in the
187+
[`RTCActionBase`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L176)
188+
class. Inside the `GetLLVMModuleAction`, we instantiate and execute the
189+
aforementioned `EmitLLVMAction`, and, in case the translation was successful,
190+
[obtains ownership](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L255)
191+
of the constructed `llvm::Module` from it.
192+
193+
Finally, the call to
194+
[`Action.takeModule()`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L442)
195+
transfers ownership again to the caller of `compileDeviceCode`. Note that this
196+
simple mechanism works because we know that there is only a single compilation
197+
happening for every instance of the `ClangTool` and hence our
198+
`GetLLVMModuleAction` class.
199+
200+
## Caching
201+
202+
The implementation optionally uses the runtime's
203+
[persistent cache](https://intel.github.io/llvm/design/KernelProgramCache.html#persistent-cache)
204+
to elide recurring invocations of the frontend, which we observed to be the most
205+
expensive (in terms of runtime overhead) phase of our compilation pipeline.
206+
207+
### Overall design
208+
209+
We cache only the frontend invocation, meaning that after a successful
210+
translation, we store the LLVM IR module obtained via LibTooling on disk in the
211+
Bitcode format using built-in utilities. In case of a cache hit in a later
212+
runtime compilation, we load the module from disk and feed it into the device
213+
linking phase. The rationale for this design was that were no utilities to save
214+
and restore the linked and post-processed device images to disk at the time (the
215+
[SYCLBIN](https://intel.github.io/llvm/design/SYCLBINDesign.html) infrastructure
216+
was added later), and caching these steps would have resulted only in marginal
217+
further runtime savings.
218+
219+
### Cache key considerations
220+
221+
The main challenge is to define a robust cache key. Because code compiled via
222+
SYCL-RTC can `#include` header files defined via the `include_files` property as
223+
well as from the filesystem, e.g. `sycl.hpp` from the DPC++ installation or user
224+
libraries, it is not sufficient to look only at the source string. In order to
225+
make the cache as conservative as possible (cache collisions are unlikely but
226+
mathematically possible), we decided to compute a hash value of the
227+
*preprocessed* source string, i.e. with all `#include` directives resolved. We
228+
additionally compute a hash value of the rendered command-line arguments, and
229+
append it to the hash of the preprocessed source to obtain the final cache key.
230+
231+
### Implementation notes
232+
233+
The cache key computation is implemented in the
234+
[`jit_compiler::calculateHash(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L381)
235+
function. We are again relying on LibTooling to invoke the preprocessor -
236+
handily, Clang provides a
237+
[`PreprocessorFrontendAction`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/include/clang/Frontend/FrontendAction.h#L294)
238+
that we extend to tailor to our use-case. We choose
239+
[BLAKE3](https://en.wikipedia.org/wiki/BLAKE_(hash_function)) as the hash
240+
algorithm because its proven in similar contexts (most notably,
241+
[ccache](https://ccache.dev)) and available as a utility in the LLVM ecosystem.
242+
As the output is a byte array, we apply Base64 encoding to obtain a character
243+
string for use with the persistent cache.
244+
245+
## Device library linking and SYCL-specific transformations
246+
247+
With an LLVM IR module in hand, obtained either from the frontend or the cache,
248+
the next steps in the compilation pipeline are simple.
249+
250+
The device library linking is done by the
251+
[`jit_compiler::linkDeviceLibraries(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L566)
252+
function. These libraries provide primitives for a variety of extra
253+
functionality, such as an extended set of math functions and support for
254+
`bfloat16` arithmetic, and are available as Bitcode files inside the DPC++
255+
installation or the vendor toolchain, so we just use LLVM utilities to load them
256+
into memory and link them to the module representing the runtime-compiled
257+
kernels. The main challenge here is that the logic to select the device
258+
libraries is currently not reusable from its implementation in the driver, so
259+
our implementation is a simplified copy of the
260+
[`SYCL::getDeviceLibraries(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/lib/Driver/ToolChains/SYCL.cpp#L553)
261+
method, which needs to be kept in sync with the driver code.
262+
263+
For the SYCL-specific post-processing, implemented in
264+
[`jit_compiler::performPostLink(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L750),
265+
we can reuse modular analysis and transformation passes in the
266+
[`SYCLLowerIR`](https://github.com/intel/llvm/tree/sycl/llvm/lib/SYCLLowerIR)
267+
component. The main tasks for the post-processing passes is to split the device
268+
code module into smaller units (either as requested by the user, or required by
269+
the ESIMD mode), and to compute the properties that need to be passed to the
270+
SYCL runtime when the device images are loaded. The logic to orchestrate the
271+
`SYCLLowerIR` passes is adapted from the `sycl-post-link` tool's
272+
[`processInputModule(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/llvm/tools/sycl-post-link/sycl-post-link.cpp#L606)
273+
function. This duplicated code should be removed as well once a suitable
274+
reusable implementation becomes available.
275+
276+
277+
## Translation to the target format
278+
279+
The
280+
[final phase](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/RTC.cpp#L139)
281+
in the pipeline is to translate the LLVM IR modules resulting from the previous
282+
phase into a device-specific target format that can be handled by the runtime.
283+
For Intel CPUs and GPUs, that's binary SPIR-V. For AMD and NVIDIA GPUs, we emit
284+
AMDGCN and PTX assembly, respectively. Over time, we created our own set of
285+
[utilities](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/translation/Translation.h)
286+
to facilitate the translation. Internally, we dispatch the task to either the
287+
SPIR-V translator (a copy of which is maintained inside the DPC++ repository),
288+
or use vendor-specific backends that are part of LLVM to generate the
289+
third-party GPU code.
290+
291+
## Third-party hardware support
292+
293+
SYCL-RTC works for AMD and NVIDIA GPUs, too. The usage of the `kernel_compiler`
294+
extension remains the same for SYCL devices representing such a third-party GPU.
295+
The concrete GPU architecture is queried via the environment variable
296+
`SYCL_JIT_AMDGCN_PTX_TARGET_CPU` when executing the RTC-enabled application. For
297+
AMD GPUs, it is **mandatory** to set it. For NVIDIA GPUs, it is highly
298+
recommended to change it from the conservative default architecture (`sm_50`).
299+
300+
```shell
301+
$ clang++ -fsycl myapp.cpp -o myapp
302+
$ SYCL_JIT_AMDGCN_PTX_TARGET_CPU=sm_90 ./myapp
303+
```
304+
305+
A list of values that can be set as the target CPU can be found in the
306+
[documentation of the `-fsycl-targets=`
307+
option](https://intel.github.io/llvm/UsersManual.html#generic-options) (leave
308+
out the `amd_gpu_` and `nvidia_gpu_` prefixes).
309+
310+
## Further reading
311+
312+
- Technical presentation at IWOCL 2025: *Fast In-Memory Runtime Compilation of
313+
SYCL Code*:
314+
[Slides](https://www.iwocl.org/wp-content/uploads/iwocl-2025-julian-oppermann-fast-in-memory-runtime.pdf)
315+
[Video Recording](https://youtu.be/X9mS8xetZJY)
316+
- Blog post:
317+
[*SYCL Runtime Compilation: A New Way to Specialise Kernels Using C++ Metaprogramming*](https://codeplay.com/portal/blogs/2025/07/08/sycl-runtime-compilation)

sycl/doc/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ Design Documents for the oneAPI DPC++ Compiler
5656
design/PrivateAlloca
5757
design/SYCLBINDesign
5858
design/PropertySets
59+
design/SYCL-RTC
5960
New OpenCL Extensions <https://github.com/intel/llvm/tree/sycl/sycl/doc/design/opencl-extensions>
6061
New SPIR-V Extensions <https://github.com/intel/llvm/tree/sycl/sycl/doc/design/spirv-extensions>
6162
design/Fuzzing

0 commit comments

Comments
 (0)