Description
I've been working on prototypes to optimize interface dispatches in Graal. I'm planning to start working on pull requests but would like to get your feedback on the overall approach before proceeding.
Problem
Scala codebases usually define computations using composable APIs that build a graph to be later executed by an interpreter. These APIs include abstractions like Scala's and Twitter's Future
s for asynchronous computations, Stitch
and Clump
for batching of remote asynchronous calls, Monix's Task
and ZIO's IO
for defining pure computations, and many other monad-like APIs.
These abstractions introduce an execution behavior usually not present in traditional Java applications. Given that the definition of the computation graph involves several method invocations, the interpreter typically doesn't get inlined with the definition of the computation, which produces a high number of invokeinterface
calls. At interpretation time, there isn't relevant profiling information to determine the concrete implementations of the transformations, so they don't get optimized.
For instance, this is a screenshot of a VTune profiling session of an application that uses Stitch
:
The itable stub
is the mechanism that Hotspot uses to dispatch invokeinterface
calls, and, as we can observe, most of the CPU usage goes into executing these stubs.
Better inlining might be able to reduce the number of invokeinterface
calls but, given the nature of these codebases, they'll always produce a high number of interface dispatches.
Solution
I've been exploring alternatives to optimize interface dispatches in Graal. The idea is to have a set of strategies that leverage profiling information and common patterns to inline invokeinterface
stubs. Graal would evaluate these strategies and choose the cheapest one.
Strategies
I've identified a few strategies so far:
Common offset
If the observed implementations of an interface resolve to the same method offset, it's possible to inline the stub at the common offset and introduce a guard to deoptimize if a new implementation is observed.
Common superclass
If the observed implementations extend a common super class that contains the method, it's possible to dispatch through the super class method and introduce a guard to deoptimize if the object doesn't extend the super class. For instance, all Function1
generated by the Scala compiler have an AbstractFunction1
superclass. The same happens for a number of interfaces in Scala collections.
Single method
If the observed implementations have a single method (besides the ones from Object
), it's possible to dispatch directly at the offset and introduce a guard to deoptimize if the itable is larger than expected.
Caching
At JIT compilation time, it's possible to determine the offsets of each observed implementation and create a TypeSwitch
that does the dispatch directly at the specific offsets. This is a simple strategy but might be cheaper for some cases, especially since Hotspost doesn't share itable stubs in most architectures and thus doesn't leverage branch prediction well.
Better TypeSwitch
Graal currently compiles TypeSwitch
nodes using linear search (sequential switch strategy). I've been exploring an approach to compile TypeSwitch
es based on the addresses of the classes. That would allow Graal to evaluate other switch strategies and choose the cheapest one. Given that some of the invokeinterface strategies would use TypeSwitch
nodes, optimizing it is important to make them effective.
Additionally, I've been working on a new switch compilation strategy based on hashing that would work well for sparse keys (which is the case for the class addresses).
Development phases
If you're ok with the overall approach, I'm planning on submitting multiple pull requests:
- Introduce the new hashing strategy to compile switches
- Implement a better heuristic to decide between size of the switch tables and cost of the strategies.
- Make the switch strategies accept long values (they currently work only for integer) and compile
TypeSwitch
using class addresses so it can leverage the different switch strategies - Introduce the mechanism to choose between different
invokeinterface
strategies initially with two simple strategies: Caching and Fallback (that falls back to the itable stubs) - Create individual pull requests for each strategy