8363620: AArch64: reimplement emit_static_call_stub() #26638

fg1417 · 2025-08-05T10:30:13Z

In the existing implementation, the static call stub typically emits a sequence like:
isb; movk; movz; movz; movk; movz; movz; br.

This patch reimplements it using a more compact and patch-friendly sequence:

ldr x12, Label_data
ldr x8, Label_entry
br x8
Label_data:
  0x00000000
  0x00000000
Label_entry:
  0x00000000
  0x00000000

The new approach places the target addresses adjacent to the code and loads them dynamically. This allows us to update the call target by modifying only the data in memory, without changing any instructions. This avoids the need for I-cache flushes or issuing an isb[1], which are both relatively expensive operations.

While emitting direct branches in static stubs for small code caches can save 2 instructions compared to the new implementation, modifying those branches still requires I-cache flushes or an isb. This patch unifies the code generation by emitting the same static stubs for both small and large code caches.

A microbenchmark (StaticCallStub.java) demonstrates a performance uplift of approximately 43%.

Benchmark                       (length)   Mode   Cnt Master     Patch      Units
StaticCallStubFar.callCompiled    1000     avgt   5   39.346     22.474     us/op
StaticCallStubFar.callCompiled    10000    avgt   5   390.05     218.478    us/op
StaticCallStubFar.callCompiled    100000   avgt   5   3869.264   2174.001   us/op
StaticCallStubNear.callCompiled   1000     avgt   5   39.093     22.582     us/op
StaticCallStubNear.callCompiled   10000    avgt   5   387.319    217.398    us/op
StaticCallStubNear.callCompiled   100000   avgt   5   3855.825   2206.923   us/op

All tests in Tier1 to Tier3, under both release and debug builds, have passed.

[1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8363620: AArch64: reimplement emit_static_call_stub() (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26638/head:pull/26638
$ git checkout pull/26638

Update a local copy of the PR:
$ git checkout pull/26638
$ git pull https://git.openjdk.org/jdk.git pull/26638/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26638

View PR using the GUI difftool:
$ git pr show -t 26638

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26638.diff

Using Webrev

Link to Webrev Comment

In the existing implementation, the static call stub typically emits a sequence like: `isb; movk; movz; movz; movk; movz; movz; br`. This patch reimplements it using a more compact and patch-friendly sequence: ``` ldr x12, Label_data ldr x8, Label_entry br x8 Label_data: 0x00000000 0x00000000 Label_entry: 0x00000000 0x00000000 ``` The new approach places the target addresses adjacent to the code and loads them dynamically. This allows us to update the call target by modifying only the data in memory, without changing any instructions. This avoids the need for I-cache flushes or issuing an `isb`[1], which are both relatively expensive operations. While emitting direct branches in static stubs for small code caches can save 2 bytes compared to the new implementation, modifying those branches still requires I-cache flushes or an `isb`. This patch unifies the code generation by emitting the same static stubs for both small and large code caches. A microbenchmark (StaticCallStub.java) demonstrates a performance uplift of approximately 43%. Benchmark (length) Mode Cnt Master Patch Units StaticCallStubFar.callCompiled 1000 avgt 5 39.346 22.474 us/op StaticCallStubFar.callCompiled 10000 avgt 5 390.05 218.478 us/op StaticCallStubFar.callCompiled 100000 avgt 5 3869.264 2174.001 us/op StaticCallStubNear.callCompiled 1000 avgt 5 39.093 22.582 us/op StaticCallStubNear.callCompiled 10000 avgt 5 387.319 217.398 us/op StaticCallStubNear.callCompiled 100000 avgt 5 3855.825 2206.923 us/op All tests in Tier1 to Tier3, under both release and debug builds, have passed. [1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads

bridgekeeper · 2025-08-05T10:30:56Z

👋 Welcome back fgao! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-08-05T10:32:39Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2025-08-05T10:33:27Z

@fg1417 The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-08-05T10:36:16Z

Webrevs

00: Full (5f9285ca)

fg1417 · 2025-08-05T11:45:18Z

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp

+  ldr(rmethod, far_jump_metadata);
+  ldr(rscratch1, far_jump_entry);
+  br(rscratch1);
+  bind(far_jump_metadata);


I’m considering whether the data here should be 8-byte aligned, similar to what we did for the trampoline stubs

jdk/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp

Line 950 in 743c821

align(wordSize);

I tried with a small case:

@CompilerControl(CompilerControl.Mode.EXCLUDE) public static void callInterpreted0(int i) { val0 = i; } @CompilerControl(CompilerControl.Mode.EXCLUDE) public static void callInterpreted1(int i) { val1 = i; } @CompilerControl(CompilerControl.Mode.EXCLUDE) public static void callInterpreted2(int i) { val2 = i; } @Benchmark public void callCompiled() { for (int i = 0; i < length; i++) { callInterpreted0(i); // Make sure this is excluded from compilation callInterpreted1(i); callInterpreted2(i); } }

where the static stubs are laid out as follows:

0x0000eee528201dd0: ldr x8, 0x0000eee528201dd8 ; {trampoline_stub} 0x0000eee528201dd4: br x8 0x0000eee528201dd8: .inst 0x2f69fe40 ; undefined 0x0000eee528201ddc: .inst 0x0000eee5 ; undefined 0x0000eee528201de0: ldr x8, 0x0000eee528201de8 ; {trampoline_stub} 0x0000eee528201de4: br x8 0x0000eee528201de8: .inst 0x2f69fe40 ; undefined 0x0000eee528201dec: .inst 0x0000eee5 ; undefined 0x0000eee528201df0: ldr x8, 0x0000eee528201df8 ; {trampoline_stub} 0x0000eee528201df4: br x8 0x0000eee528201df8: .inst 0x2f69fe40 ; undefined 0x0000eee528201dfc: .inst 0x0000eee5 ; undefined 0x0000eee528201e00: ldr x12, 0x0000eee528201e0c ; {static_stub} 0x0000eee528201e04: ldr x8, 0x0000eee528201e14 0x0000eee528201e08: br x8 0x0000eee528201e0c: .inst 0x00000000 ; undefined 0x0000eee528201e10: .inst 0x00000000 ; undefined 0x0000eee528201e14: .inst 0x00000000 ; undefined 0x0000eee528201e18: .inst 0x00000000 ; undefined 0x0000eee528201e1c: ldr x12, 0x0000eee528201e28 ; {static_stub} 0x0000eee528201e20: ldr x8, 0x0000eee528201e30 0x0000eee528201e24: br x8 0x0000eee528201e28: .inst 0x00000000 ; undefined 0x0000eee528201e2c: .inst 0x00000000 ; undefined 0x0000eee528201e30: .inst 0x00000000 ; undefined 0x0000eee528201e34: .inst 0x00000000 ; undefined 0x0000eee528201e38: ldr x12, 0x0000eee528201e44 ; {static_stub} 0x0000eee528201e3c: ldr x8, 0x0000eee528201e4c 0x0000eee528201e40: br x8 0x0000eee528201e44: .inst 0x00000000 ; undefined 0x0000eee528201e48: .inst 0x00000000 ; undefined 0x0000eee528201e4c: .inst 0x00000000 ; undefined 0x0000eee528201e50: .inst 0x00000000 ; undefined

Here are the performance results:

Benchmark (length) Mode Cnt Master Aligned Unaligned Units StaticCallStub.StaticCallStubFar.callCompiled 1000 avgt 5 114.794 63.117 64.346 us/op StaticCallStub.StaticCallStubFar.callCompiled 10000 avgt 5 1136.016 618.576 619.629 us/op StaticCallStub.StaticCallStubFar.callCompiled 100000 avgt 5 11323.945 6191.452 6277.813 us/op StaticCallStub.StaticCallStubNear.callCompiled 1000 avgt 5 114.335 63.142 64.091 us/op StaticCallStub.StaticCallStubNear.callCompiled 10000 avgt 5 1140.667 618.653 619.861 us/op StaticCallStub.StaticCallStubNear.callCompiled 100000 avgt 5 11351.394 6194.946 6195.255 us/op

We have several aspects to consider:

8-byte alignment brings a minor performance gain, but it’s not significant compared to the overall improvement achieved by reimplementing the static stubs.

Unaligned memory access may be non-atomic, although in this case, other threads aren’t modifying the data.

The alignment requirement for trampoline stubs doesn’t always introduce extra NOPs—padding is only added when trampoline and static stubs are interleaved. In contrast, enforcing 8-byte alignment for static stubs almost always introduces padding, since the first three instructions are not naturally aligned.

We should carefully balance the trade-off between code size increase and code hotness (i.e., how frequently the stub code is executed).

Any feedback or suggestions would be greatly appreciated.

Unaligned memory access may be non-atomic, although in this case, other threads aren’t modifying the data.

I thought other threads are modifying the data, which is the reason why we needed the ISB.

Unaligned memory access may be non-atomic, although in this case, other threads aren’t modifying the data.

I thought other threads are modifying the data, which is the reason why we needed the ISB.

Yes, exactly. Other threads are executing this nmethod while we are modifying its static call stub.

We hold the lock while we're doing this, so any thread racing with us to modify the call stub would be blocked. However, another thread could execute the call we just patched but not fully observe the changes we made to the stub, which is why we need the ISB.

We have a similar situation with this PR, but with data rather than instruction memory. We need to make sure that any racing thread observes changes to the stub before it observes the patched call to the stub. There is no ordering between instruction and data memory, which is why the current stub uses only instruction memory, keeping everything right. The ISB ensures that the executing thread observes the changed stub: there is a happens-before from the ICache::invalidate_range after patching the stub (but before patching the call) to the isb at the start of the stub.

For this PR to be correctly synchronized, we'd need a similar happens-before from patching the constants used in the stub to executing the stub. I can think of a way to do that, but it's fiddly and may not be worth it.

Yes, it's a very subtle point that the isb serves to isolate threads executing the call from partial updates by the writing thread i.e. make the change to the call site atomic. It might be useful to have that written in some comment in the patching code rather than just documented in this issue (but it helps to have it said here).

I understand that an isb is relatively expensive and I don't doubt the figures provided from running the micro-benchmark. However, they are not necessarily any indication that there is a problem here. How many such call sites are there and how often do they get patched? Is that significant relative to, say, the number of times they are executed or some other measure of overall cost? Is there evidence that we really need to fix this?

Thanks for your reviewing @theRealAph @adinn @dean-long .

Also thanks a lot for the explanation — it’s really helpful and clarified why the isb is necessary in the current implementation.

This patch is inspired by the trampoline stub design, and I’m trying to understand why that approach works as it does. Happy to be corrected if I’ve got anything wrong.

Initially, we generate a trampoline stub to redirect control flow from the caller to the static call resolver. Once the static stub is patched, we also update the trampoline stub, though in some cases the call may bypass the trampoline entirely and jump directly from the caller to the static stub. When the callee is eventually compiled, we reset the trampoline to once again direct the call back to the resolver. After resolution completes, we then update the trampoline to point to the compiled callee.

My question is: Why isn’t any explicit memory synchronization required before executing the trampoline stub? How do we ensure that any data loaded by the trampoline stub is fully synchronized across multiple threads? I’d appreciate any insights.

Thank you very much!

My question is: Why isn’t any explicit memory synchronization required before executing the trampoline stub? How do we ensure that any data loaded by the trampoline stub is fully synchronized across multiple threads?

It isn't. Other threads may call the static call resolver multiple times even after it's been patched. The difference is that for a trampoline call there is only one address to load so it can be patched atomically by a single store, and it doesn't matter if the address is stale.

We could read/write the { code; method } pair inside a critical section, thus guaranteeing that the operation is correctly synchronized. It might well be a bit faster, but it would be fiddly, especially on Apple silicon.

dean-long · 2025-08-05T21:40:27Z

In the past we would have scheduled the load for x8 earlier to avoid a latency hit for using the value in the next instruction:

ldr x8, Label_entry
ldr x12, Label_data
br x8

Is this still helpful on modern aarch64 hardware?

theRealAph · 2025-08-06T08:42:44Z

In the past we would have scheduled the load for x8 earlier to avoid a latency hit for using the value in the next instruction:
ldr x8, Label_entry
ldr x12, Label_data
br x8
Is this still helpful on modern aarch64 hardware?

Maybe on something very small, such as the Cortex-A34. It doesn't hurt larger out-of-order cores, so I agree with you that we should reorder this.

openjdk bot added the rfr Pull request is ready for review label Aug 5, 2025

openjdk bot added the hotspot [email protected] label Aug 5, 2025

fg1417 commented Aug 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

8363620: AArch64: reimplement emit_static_call_stub() #26638

8363620: AArch64: reimplement emit_static_call_stub() #26638

fg1417 commented Aug 5, 2025 •

edited by openjdk bot

Loading

Uh oh!

bridgekeeper bot commented Aug 5, 2025

Uh oh!

openjdk bot commented Aug 5, 2025

Uh oh!

openjdk bot commented Aug 5, 2025

Uh oh!

mlbridge bot commented Aug 5, 2025

Uh oh!

fg1417 Aug 5, 2025

Uh oh!

dean-long Aug 5, 2025

Uh oh!

theRealAph Aug 6, 2025

Uh oh!

adinn Aug 6, 2025 •

edited

Loading

Uh oh!

fg1417 Aug 7, 2025

Uh oh!

theRealAph Aug 7, 2025

Uh oh!

dean-long commented Aug 5, 2025

Uh oh!

theRealAph commented Aug 6, 2025

Uh oh!

Uh oh!

8363620: AArch64: reimplement emit_static_call_stub() #26638

Are you sure you want to change the base?

8363620: AArch64: reimplement emit_static_call_stub() #26638

Conversation

fg1417 commented Aug 5, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewing

Uh oh!

bridgekeeper bot commented Aug 5, 2025

Uh oh!

openjdk bot commented Aug 5, 2025

Uh oh!

openjdk bot commented Aug 5, 2025

Uh oh!

mlbridge bot commented Aug 5, 2025

Webrevs

Uh oh!

fg1417 Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

dean-long Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

theRealAph Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

adinn Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fg1417 Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

theRealAph Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

dean-long commented Aug 5, 2025

Uh oh!

theRealAph commented Aug 6, 2025

Uh oh!

Uh oh!

fg1417 commented Aug 5, 2025 •

edited by openjdk bot

Loading

adinn Aug 6, 2025 •

edited

Loading