Skip to content

[Proposal] Invoke interface optimizations #893

Open
@fwbrasil

Description

@fwbrasil

I've been working on prototypes to optimize interface dispatches in Graal. I'm planning to start working on pull requests but would like to get your feedback on the overall approach before proceeding.

Problem

Scala codebases usually define computations using composable APIs that build a graph to be later executed by an interpreter. These APIs include abstractions like Scala's and Twitter's Futures for asynchronous computations, Stitch and Clump for batching of remote asynchronous calls, Monix's Task and ZIO's IO for defining pure computations, and many other monad-like APIs.

These abstractions introduce an execution behavior usually not present in traditional Java applications. Given that the definition of the computation graph involves several method invocations, the interpreter typically doesn't get inlined with the definition of the computation, which produces a high number of invokeinterface calls. At interpretation time, there isn't relevant profiling information to determine the concrete implementations of the transformations, so they don't get optimized.

For instance, this is a screenshot of a VTune profiling session of an application that uses Stitch:

The itable stub is the mechanism that Hotspot uses to dispatch invokeinterface calls, and, as we can observe, most of the CPU usage goes into executing these stubs.

Better inlining might be able to reduce the number of invokeinterface calls but, given the nature of these codebases, they'll always produce a high number of interface dispatches.

Solution

I've been exploring alternatives to optimize interface dispatches in Graal. The idea is to have a set of strategies that leverage profiling information and common patterns to inline invokeinterface stubs. Graal would evaluate these strategies and choose the cheapest one.

Strategies

I've identified a few strategies so far:

Common offset

If the observed implementations of an interface resolve to the same method offset, it's possible to inline the stub at the common offset and introduce a guard to deoptimize if a new implementation is observed.

Common superclass

If the observed implementations extend a common super class that contains the method, it's possible to dispatch through the super class method and introduce a guard to deoptimize if the object doesn't extend the super class. For instance, all Function1 generated by the Scala compiler have an AbstractFunction1 superclass. The same happens for a number of interfaces in Scala collections.

Single method

If the observed implementations have a single method (besides the ones from Object), it's possible to dispatch directly at the offset and introduce a guard to deoptimize if the itable is larger than expected.

Caching

At JIT compilation time, it's possible to determine the offsets of each observed implementation and create a TypeSwitch that does the dispatch directly at the specific offsets. This is a simple strategy but might be cheaper for some cases, especially since Hotspost doesn't share itable stubs in most architectures and thus doesn't leverage branch prediction well.

Better TypeSwitch

Graal currently compiles TypeSwitch nodes using linear search (sequential switch strategy). I've been exploring an approach to compile TypeSwitches based on the addresses of the classes. That would allow Graal to evaluate other switch strategies and choose the cheapest one. Given that some of the invokeinterface strategies would use TypeSwitch nodes, optimizing it is important to make them effective.

Additionally, I've been working on a new switch compilation strategy based on hashing that would work well for sparse keys (which is the case for the class addresses).

Development phases

If you're ok with the overall approach, I'm planning on submitting multiple pull requests:

  1. Introduce the new hashing strategy to compile switches
  2. Implement a better heuristic to decide between size of the switch tables and cost of the strategies.
  3. Make the switch strategies accept long values (they currently work only for integer) and compile TypeSwitch using class addresses so it can leverage the different switch strategies
  4. Introduce the mechanism to choose between different invokeinterface strategies initially with two simple strategies: Caching and Fallback (that falls back to the itable stubs)
  5. Create individual pull requests for each strategy

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions