-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8363620: AArch64: reimplement emit_static_call_stub() #26638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
In the existing implementation, the static call stub typically emits a sequence like: `isb; movk; movz; movz; movk; movz; movz; br`. This patch reimplements it using a more compact and patch-friendly sequence: ``` ldr x12, Label_data ldr x8, Label_entry br x8 Label_data: 0x00000000 0x00000000 Label_entry: 0x00000000 0x00000000 ``` The new approach places the target addresses adjacent to the code and loads them dynamically. This allows us to update the call target by modifying only the data in memory, without changing any instructions. This avoids the need for I-cache flushes or issuing an `isb`[1], which are both relatively expensive operations. While emitting direct branches in static stubs for small code caches can save 2 bytes compared to the new implementation, modifying those branches still requires I-cache flushes or an `isb`. This patch unifies the code generation by emitting the same static stubs for both small and large code caches. A microbenchmark (StaticCallStub.java) demonstrates a performance uplift of approximately 43%. Benchmark (length) Mode Cnt Master Patch Units StaticCallStubFar.callCompiled 1000 avgt 5 39.346 22.474 us/op StaticCallStubFar.callCompiled 10000 avgt 5 390.05 218.478 us/op StaticCallStubFar.callCompiled 100000 avgt 5 3869.264 2174.001 us/op StaticCallStubNear.callCompiled 1000 avgt 5 39.093 22.582 us/op StaticCallStubNear.callCompiled 10000 avgt 5 387.319 217.398 us/op StaticCallStubNear.callCompiled 100000 avgt 5 3855.825 2206.923 us/op All tests in Tier1 to Tier3, under both release and debug builds, have passed. [1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads
👋 Welcome back fgao! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
ldr(rmethod, far_jump_metadata); | ||
ldr(rscratch1, far_jump_entry); | ||
br(rscratch1); | ||
bind(far_jump_metadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m considering whether the data here should be 8-byte aligned, similar to what we did for the trampoline stubs
align(wordSize); |
I tried with a small case:
@CompilerControl(CompilerControl.Mode.EXCLUDE)
public static void callInterpreted0(int i) {
val0 = i;
}
@CompilerControl(CompilerControl.Mode.EXCLUDE)
public static void callInterpreted1(int i) {
val1 = i;
}
@CompilerControl(CompilerControl.Mode.EXCLUDE)
public static void callInterpreted2(int i) {
val2 = i;
}
@Benchmark
public void callCompiled() {
for (int i = 0; i < length; i++) {
callInterpreted0(i); // Make sure this is excluded from compilation
callInterpreted1(i);
callInterpreted2(i);
}
}
where the static stubs are laid out as follows:
0x0000eee528201dd0: ldr x8, 0x0000eee528201dd8 ; {trampoline_stub}
0x0000eee528201dd4: br x8
0x0000eee528201dd8: .inst 0x2f69fe40 ; undefined
0x0000eee528201ddc: .inst 0x0000eee5 ; undefined
0x0000eee528201de0: ldr x8, 0x0000eee528201de8 ; {trampoline_stub}
0x0000eee528201de4: br x8
0x0000eee528201de8: .inst 0x2f69fe40 ; undefined
0x0000eee528201dec: .inst 0x0000eee5 ; undefined
0x0000eee528201df0: ldr x8, 0x0000eee528201df8 ; {trampoline_stub}
0x0000eee528201df4: br x8
0x0000eee528201df8: .inst 0x2f69fe40 ; undefined
0x0000eee528201dfc: .inst 0x0000eee5 ; undefined
0x0000eee528201e00: ldr x12, 0x0000eee528201e0c ; {static_stub}
0x0000eee528201e04: ldr x8, 0x0000eee528201e14
0x0000eee528201e08: br x8
0x0000eee528201e0c: .inst 0x00000000 ; undefined
0x0000eee528201e10: .inst 0x00000000 ; undefined
0x0000eee528201e14: .inst 0x00000000 ; undefined
0x0000eee528201e18: .inst 0x00000000 ; undefined
0x0000eee528201e1c: ldr x12, 0x0000eee528201e28 ; {static_stub}
0x0000eee528201e20: ldr x8, 0x0000eee528201e30
0x0000eee528201e24: br x8
0x0000eee528201e28: .inst 0x00000000 ; undefined
0x0000eee528201e2c: .inst 0x00000000 ; undefined
0x0000eee528201e30: .inst 0x00000000 ; undefined
0x0000eee528201e34: .inst 0x00000000 ; undefined
0x0000eee528201e38: ldr x12, 0x0000eee528201e44 ; {static_stub}
0x0000eee528201e3c: ldr x8, 0x0000eee528201e4c
0x0000eee528201e40: br x8
0x0000eee528201e44: .inst 0x00000000 ; undefined
0x0000eee528201e48: .inst 0x00000000 ; undefined
0x0000eee528201e4c: .inst 0x00000000 ; undefined
0x0000eee528201e50: .inst 0x00000000 ; undefined
Here are the performance results:
Benchmark (length) Mode Cnt Master Aligned Unaligned Units
StaticCallStub.StaticCallStubFar.callCompiled 1000 avgt 5 114.794 63.117 64.346 us/op
StaticCallStub.StaticCallStubFar.callCompiled 10000 avgt 5 1136.016 618.576 619.629 us/op
StaticCallStub.StaticCallStubFar.callCompiled 100000 avgt 5 11323.945 6191.452 6277.813 us/op
StaticCallStub.StaticCallStubNear.callCompiled 1000 avgt 5 114.335 63.142 64.091 us/op
StaticCallStub.StaticCallStubNear.callCompiled 10000 avgt 5 1140.667 618.653 619.861 us/op
StaticCallStub.StaticCallStubNear.callCompiled 100000 avgt 5 11351.394 6194.946 6195.255 us/op
We have several aspects to consider:
-
8-byte alignment brings a minor performance gain, but it’s not significant compared to the overall improvement achieved by reimplementing the static stubs.
-
Unaligned memory access may be non-atomic, although in this case, other threads aren’t modifying the data.
-
The alignment requirement for trampoline stubs doesn’t always introduce extra NOPs—padding is only added when trampoline and static stubs are interleaved. In contrast, enforcing 8-byte alignment for static stubs almost always introduces padding, since the first three instructions are not naturally aligned.
-
We should carefully balance the trade-off between code size increase and code hotness (i.e., how frequently the stub code is executed).
Any feedback or suggestions would be greatly appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unaligned memory access may be non-atomic, although in this case, other threads aren’t modifying the data.
I thought other threads are modifying the data, which is the reason why we needed the ISB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unaligned memory access may be non-atomic, although in this case, other threads aren’t modifying the data.
I thought other threads are modifying the data, which is the reason why we needed the ISB.
Yes, exactly. Other threads are executing this nmethod while we are modifying its static call stub.
We hold the lock while we're doing this, so any thread racing with us to modify the call stub would be blocked. However, another thread could execute the call we just patched but not fully observe the changes we made to the stub, which is why we need the ISB.
We have a similar situation with this PR, but with data rather than instruction memory. We need to make sure that any racing thread observes changes to the stub before it observes the patched call to the stub. There is no ordering between instruction and data memory, which is why the current stub uses only instruction memory, keeping everything right. The ISB ensures that the executing thread observes the changed stub: there is a happens-before from the ICache::invalidate_range
after patching the stub (but before patching the call) to the isb
at the start of the stub.
For this PR to be correctly synchronized, we'd need a similar happens-before from patching the constants used in the stub to executing the stub. I can think of a way to do that, but it's fiddly and may not be worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a very subtle point that the isb serves to isolate threads executing the call from partial updates by the writing thread i.e. make the change to the call site atomic. It might be useful to have that written in some comment in the patching code rather than just documented in this issue (but it helps to have it said here).
I understand that an isb is relatively expensive and I don't doubt the figures provided from running the micro-benchmark. However, they are not necessarily any indication that there is a problem here. How many such call sites are there and how often do they get patched? Is that significant relative to, say, the number of times they are executed or some other measure of overall cost? Is there evidence that we really need to fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reviewing @theRealAph @adinn @dean-long .
Also thanks a lot for the explanation — it’s really helpful and clarified why the isb
is necessary in the current implementation.
This patch is inspired by the trampoline stub design, and I’m trying to understand why that approach works as it does. Happy to be corrected if I’ve got anything wrong.
Initially, we generate a trampoline stub to redirect control flow from the caller to the static call resolver. Once the static stub is patched, we also update the trampoline stub, though in some cases the call may bypass the trampoline entirely and jump directly from the caller to the static stub. When the callee is eventually compiled, we reset the trampoline to once again direct the call back to the resolver. After resolution completes, we then update the trampoline to point to the compiled callee.
My question is: Why isn’t any explicit memory synchronization required before executing the trampoline stub? How do we ensure that any data loaded by the trampoline stub is fully synchronized across multiple threads? I’d appreciate any insights.
Thank you very much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is: Why isn’t any explicit memory synchronization required before executing the trampoline stub? How do we ensure that any data loaded by the trampoline stub is fully synchronized across multiple threads?
It isn't. Other threads may call the static call resolver multiple times even after it's been patched. The difference is that for a trampoline call there is only one address to load so it can be patched atomically by a single store, and it doesn't matter if the address is stale.
We could read/write the { code; method } pair inside a critical section, thus guaranteeing that the operation is correctly synchronized. It might well be a bit faster, but it would be fiddly, especially on Apple silicon.
In the past we would have scheduled the load for x8 earlier to avoid a latency hit for using the value in the next instruction:
Is this still helpful on modern aarch64 hardware? |
Maybe on something very small, such as the Cortex-A34. It doesn't hurt larger out-of-order cores, so I agree with you that we should reorder this. |
In the existing implementation, the static call stub typically emits a sequence like:
isb; movk; movz; movz; movk; movz; movz; br
.This patch reimplements it using a more compact and patch-friendly sequence:
The new approach places the target addresses adjacent to the code and loads them dynamically. This allows us to update the call target by modifying only the data in memory, without changing any instructions. This avoids the need for I-cache flushes or issuing an
isb
[1], which are both relatively expensive operations.While emitting direct branches in static stubs for small code caches can save 2 instructions compared to the new implementation, modifying those branches still requires I-cache flushes or an
isb
. This patch unifies the code generation by emitting the same static stubs for both small and large code caches.A microbenchmark (StaticCallStub.java) demonstrates a performance uplift of approximately 43%.
All tests in Tier1 to Tier3, under both release and debug builds, have passed.
[1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26638/head:pull/26638
$ git checkout pull/26638
Update a local copy of the PR:
$ git checkout pull/26638
$ git pull https://git.openjdk.org/jdk.git pull/26638/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 26638
View PR using the GUI difftool:
$ git pr show -t 26638
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26638.diff
Using Webrev
Link to Webrev Comment