Skip to content

8363620: AArch64: reimplement emit_static_call_stub() #26638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fg1417
Copy link

@fg1417 fg1417 commented Aug 5, 2025

In the existing implementation, the static call stub typically emits a sequence like:
isb; movk; movz; movz; movk; movz; movz; br.

This patch reimplements it using a more compact and patch-friendly sequence:

ldr x12, Label_data
ldr x8, Label_entry
br x8
Label_data:
  0x00000000
  0x00000000
Label_entry:
  0x00000000
  0x00000000

The new approach places the target addresses adjacent to the code and loads them dynamically. This allows us to update the call target by modifying only the data in memory, without changing any instructions. This avoids the need for I-cache flushes or issuing an isb[1], which are both relatively expensive operations.

While emitting direct branches in static stubs for small code caches can save 2 instructions compared to the new implementation, modifying those branches still requires I-cache flushes or an isb. This patch unifies the code generation by emitting the same static stubs for both small and large code caches.

A microbenchmark (StaticCallStub.java) demonstrates a performance uplift of approximately 43%.

Benchmark                       (length)   Mode   Cnt Master     Patch      Units
StaticCallStubFar.callCompiled    1000     avgt   5   39.346     22.474     us/op
StaticCallStubFar.callCompiled    10000    avgt   5   390.05     218.478    us/op
StaticCallStubFar.callCompiled    100000   avgt   5   3869.264   2174.001   us/op
StaticCallStubNear.callCompiled   1000     avgt   5   39.093     22.582     us/op
StaticCallStubNear.callCompiled   10000    avgt   5   387.319    217.398    us/op
StaticCallStubNear.callCompiled   100000   avgt   5   3855.825   2206.923   us/op

All tests in Tier1 to Tier3, under both release and debug builds, have passed.

[1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8363620: AArch64: reimplement emit_static_call_stub() (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26638/head:pull/26638
$ git checkout pull/26638

Update a local copy of the PR:
$ git checkout pull/26638
$ git pull https://git.openjdk.org/jdk.git pull/26638/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26638

View PR using the GUI difftool:
$ git pr show -t 26638

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26638.diff

Using Webrev

Link to Webrev Comment

In the existing implementation, the static call stub typically
emits a sequence like:
`isb; movk; movz; movz; movk; movz; movz; br`.

This patch reimplements it using a more compact and patch-friendly
sequence:
```
ldr x12, Label_data
ldr x8, Label_entry
br x8
Label_data:
  0x00000000
  0x00000000
Label_entry:
  0x00000000
  0x00000000
```
The new approach places the target addresses adjacent to the code
and loads them dynamically. This allows us to update the call
target by modifying only the data in memory, without changing any
instructions. This avoids the need for I-cache flushes or
issuing an `isb`[1], which are both relatively expensive
operations.

While emitting direct branches in static stubs for small code
caches can save 2 bytes compared to the new implementation,
modifying those branches still requires I-cache flushes or an
`isb`. This patch unifies the code generation by emitting the
same static stubs for both small and large code caches.

A microbenchmark (StaticCallStub.java) demonstrates a performance
uplift of approximately 43%.

Benchmark                       (length)   Mode   Cnt Master     Patch      Units
StaticCallStubFar.callCompiled    1000     avgt   5   39.346     22.474     us/op
StaticCallStubFar.callCompiled    10000    avgt   5   390.05     218.478    us/op
StaticCallStubFar.callCompiled    100000   avgt   5   3869.264   2174.001   us/op
StaticCallStubNear.callCompiled   1000     avgt   5   39.093     22.582     us/op
StaticCallStubNear.callCompiled   10000    avgt   5   387.319    217.398    us/op
StaticCallStubNear.callCompiled   100000   avgt   5   3855.825   2206.923   us/op

All tests in Tier1 to Tier3, under both release and debug builds,
have passed.

[1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads
@bridgekeeper
Copy link

bridgekeeper bot commented Aug 5, 2025

👋 Welcome back fgao! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Aug 5, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 5, 2025
@openjdk
Copy link

openjdk bot commented Aug 5, 2025

@fg1417 The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Aug 5, 2025

Webrevs

ldr(rmethod, far_jump_metadata);
ldr(rscratch1, far_jump_entry);
br(rscratch1);
bind(far_jump_metadata);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m considering whether the data here should be 8-byte aligned, similar to what we did for the trampoline stubs

I tried with a small case:

    @CompilerControl(CompilerControl.Mode.EXCLUDE)
    public static void callInterpreted0(int i) {
        val0 = i;
    }

    @CompilerControl(CompilerControl.Mode.EXCLUDE)
    public static void callInterpreted1(int i) {
        val1 = i;
    }

    @CompilerControl(CompilerControl.Mode.EXCLUDE)
    public static void callInterpreted2(int i) {
        val2 = i;
    }

    @Benchmark
    public void callCompiled() {
        for (int i = 0; i < length; i++) {
          callInterpreted0(i); // Make sure this is excluded from compilation
          callInterpreted1(i);
          callInterpreted2(i);
        }
    }

where the static stubs are laid out as follows:

  0x0000eee528201dd0:   ldr	x8, 0x0000eee528201dd8      ;   {trampoline_stub}
  0x0000eee528201dd4:   br	x8
  0x0000eee528201dd8:   .inst	0x2f69fe40 ; undefined
  0x0000eee528201ddc:   .inst	0x0000eee5 ; undefined
  0x0000eee528201de0:   ldr	x8, 0x0000eee528201de8      ;   {trampoline_stub}
  0x0000eee528201de4:   br	x8
  0x0000eee528201de8:   .inst	0x2f69fe40 ; undefined
  0x0000eee528201dec:   .inst	0x0000eee5 ; undefined
  0x0000eee528201df0:   ldr	x8, 0x0000eee528201df8      ;   {trampoline_stub}
  0x0000eee528201df4:   br	x8
  0x0000eee528201df8:   .inst	0x2f69fe40 ; undefined
  0x0000eee528201dfc:   .inst	0x0000eee5 ; undefined
  0x0000eee528201e00:   ldr	x12, 0x0000eee528201e0c     ;   {static_stub}
  0x0000eee528201e04:   ldr	x8, 0x0000eee528201e14
  0x0000eee528201e08:   br	x8
  0x0000eee528201e0c:   .inst	0x00000000 ; undefined
  0x0000eee528201e10:   .inst	0x00000000 ; undefined
  0x0000eee528201e14:   .inst	0x00000000 ; undefined
  0x0000eee528201e18:   .inst	0x00000000 ; undefined
  0x0000eee528201e1c:   ldr	x12, 0x0000eee528201e28     ;   {static_stub}
  0x0000eee528201e20:   ldr	x8, 0x0000eee528201e30
  0x0000eee528201e24:   br	x8
  0x0000eee528201e28:   .inst	0x00000000 ; undefined
  0x0000eee528201e2c:   .inst	0x00000000 ; undefined
  0x0000eee528201e30:   .inst	0x00000000 ; undefined
  0x0000eee528201e34:   .inst	0x00000000 ; undefined
  0x0000eee528201e38:   ldr	x12, 0x0000eee528201e44     ;   {static_stub}
  0x0000eee528201e3c:   ldr	x8, 0x0000eee528201e4c
  0x0000eee528201e40:   br	x8
  0x0000eee528201e44:   .inst	0x00000000 ; undefined
  0x0000eee528201e48:   .inst	0x00000000 ; undefined
  0x0000eee528201e4c:   .inst	0x00000000 ; undefined
  0x0000eee528201e50:   .inst	0x00000000 ; undefined

Here are the performance results:

Benchmark                                       (length)    Mode    Cnt   Master       Aligned     Unaligned   Units
StaticCallStub.StaticCallStubFar.callCompiled     1000      avgt    5     114.794      63.117      64.346      us/op
StaticCallStub.StaticCallStubFar.callCompiled     10000     avgt    5     1136.016     618.576     619.629     us/op
StaticCallStub.StaticCallStubFar.callCompiled     100000    avgt    5     11323.945    6191.452    6277.813    us/op
StaticCallStub.StaticCallStubNear.callCompiled    1000      avgt    5     114.335      63.142      64.091      us/op
StaticCallStub.StaticCallStubNear.callCompiled    10000     avgt    5     1140.667     618.653     619.861     us/op
StaticCallStub.StaticCallStubNear.callCompiled    100000    avgt    5     11351.394    6194.946    6195.255    us/op

We have several aspects to consider:

  • 8-byte alignment brings a minor performance gain, but it’s not significant compared to the overall improvement achieved by reimplementing the static stubs.

  • Unaligned memory access may be non-atomic, although in this case, other threads aren’t modifying the data.

  • The alignment requirement for trampoline stubs doesn’t always introduce extra NOPs—padding is only added when trampoline and static stubs are interleaved. In contrast, enforcing 8-byte alignment for static stubs almost always introduces padding, since the first three instructions are not naturally aligned.

  • We should carefully balance the trade-off between code size increase and code hotness (i.e., how frequently the stub code is executed).

Any feedback or suggestions would be greatly appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unaligned memory access may be non-atomic, although in this case, other threads aren’t modifying the data.

I thought other threads are modifying the data, which is the reason why we needed the ISB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unaligned memory access may be non-atomic, although in this case, other threads aren’t modifying the data.

I thought other threads are modifying the data, which is the reason why we needed the ISB.

Yes, exactly. Other threads are executing this nmethod while we are modifying its static call stub.

We hold the lock while we're doing this, so any thread racing with us to modify the call stub would be blocked. However, another thread could execute the call we just patched but not fully observe the changes we made to the stub, which is why we need the ISB.

We have a similar situation with this PR, but with data rather than instruction memory. We need to make sure that any racing thread observes changes to the stub before it observes the patched call to the stub. There is no ordering between instruction and data memory, which is why the current stub uses only instruction memory, keeping everything right. The ISB ensures that the executing thread observes the changed stub: there is a happens-before from the ICache::invalidate_range after patching the stub (but before patching the call) to the isb at the start of the stub.

For this PR to be correctly synchronized, we'd need a similar happens-before from patching the constants used in the stub to executing the stub. I can think of a way to do that, but it's fiddly and may not be worth it.

Copy link
Contributor

@adinn adinn Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a very subtle point that the isb serves to isolate threads executing the call from partial updates by the writing thread i.e. make the change to the call site atomic. It might be useful to have that written in some comment in the patching code rather than just documented in this issue (but it helps to have it said here).

I understand that an isb is relatively expensive and I don't doubt the figures provided from running the micro-benchmark. However, they are not necessarily any indication that there is a problem here. How many such call sites are there and how often do they get patched? Is that significant relative to, say, the number of times they are executed or some other measure of overall cost? Is there evidence that we really need to fix this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reviewing @theRealAph @adinn @dean-long .

Also thanks a lot for the explanation — it’s really helpful and clarified why the isb is necessary in the current implementation.

This patch is inspired by the trampoline stub design, and I’m trying to understand why that approach works as it does. Happy to be corrected if I’ve got anything wrong.

Initially, we generate a trampoline stub to redirect control flow from the caller to the static call resolver. Once the static stub is patched, we also update the trampoline stub, though in some cases the call may bypass the trampoline entirely and jump directly from the caller to the static stub. When the callee is eventually compiled, we reset the trampoline to once again direct the call back to the resolver. After resolution completes, we then update the trampoline to point to the compiled callee.

My question is: Why isn’t any explicit memory synchronization required before executing the trampoline stub? How do we ensure that any data loaded by the trampoline stub is fully synchronized across multiple threads? I’d appreciate any insights.

Thank you very much!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is: Why isn’t any explicit memory synchronization required before executing the trampoline stub? How do we ensure that any data loaded by the trampoline stub is fully synchronized across multiple threads?

It isn't. Other threads may call the static call resolver multiple times even after it's been patched. The difference is that for a trampoline call there is only one address to load so it can be patched atomically by a single store, and it doesn't matter if the address is stale.

We could read/write the { code; method } pair inside a critical section, thus guaranteeing that the operation is correctly synchronized. It might well be a bit faster, but it would be fiddly, especially on Apple silicon.

@dean-long
Copy link
Member

In the past we would have scheduled the load for x8 earlier to avoid a latency hit for using the value in the next instruction:

ldr x8, Label_entry
ldr x12, Label_data
br x8

Is this still helpful on modern aarch64 hardware?

@theRealAph
Copy link
Contributor

In the past we would have scheduled the load for x8 earlier to avoid a latency hit for using the value in the next instruction:

ldr x8, Label_entry
ldr x12, Label_data
br x8

Is this still helpful on modern aarch64 hardware?

Maybe on something very small, such as the Cortex-A34. It doesn't hurt larger out-of-order cores, so I agree with you that we should reorder this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot [email protected] rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

4 participants