[Proposal] Invoke interface optimizations

I've been working on prototypes to optimize interface dispatches in Graal. I'm planning to start working on pull requests but would like to get your feedback on the overall approach before proceeding.

## Problem

Scala codebases usually define computations using composable APIs that build a graph to be later executed by an interpreter. These APIs include abstractions like [Scala's](https://github.com/scala/scala/blob/2.13.x/src/library/scala/concurrent/Future.scala) and [Twitter's](https://github.com/twitter/util/blob/develop/util-core/src/main/scala/com/twitter/util/Future.scala) `Future`s for asynchronous computations, [`Stitch`](https://www.youtube.com/watch?v=VVpmMfT8aYw) and [`Clump`](https://github.com/getclump/clump) for batching of remote asynchronous calls, Monix's [`Task`](https://github.com/monix/monix/blob/master/monix-eval/shared/src/main/scala/monix/eval/Task.scala) and ZIO's [`IO`](https://github.com/scalaz/scalaz-zio) for defining pure computations, and many other monad-like APIs.

These abstractions introduce an execution behavior usually not present in traditional Java applications. Given that the definition of the computation graph involves several method invocations, the interpreter typically doesn't get inlined with the definition of the computation, which produces a high number of `invokeinterface` calls. At interpretation time, there isn't relevant profiling information to determine the concrete implementations of the transformations, so they don't get optimized.

For instance, this is a screenshot of a VTune profiling session of an application that uses `Stitch`:

![](https://i.imgur.com/4UidB7K.png)

The `itable stub` is the mechanism that Hotspot uses to dispatch `invokeinterface` calls, and, as we can observe, most of the CPU usage goes into executing these stubs.

Better inlining might be able to reduce the number of `invokeinterface` calls but, given the nature of these codebases, they'll always produce a high number of interface dispatches.

## Solution

I've been exploring alternatives to optimize interface dispatches in Graal. The idea is to have a set of strategies that leverage profiling information and common patterns to inline `invokeinterface` stubs. Graal would evaluate these strategies and choose the cheapest one.

### Strategies

I've identified a few strategies so far:

Common offset

If the observed implementations of an interface resolve to the same method offset, it's possible to inline the stub at the common offset and introduce a guard to deoptimize if a new implementation is observed.

Common superclass

If the observed implementations extend a common super class that contains the method, it's possible to dispatch through the super class method and introduce a guard to deoptimize if the object doesn't extend the super class. For instance, all `Function1` generated by the Scala compiler have an `AbstractFunction1` superclass. The same happens for a number of interfaces in Scala collections.

Single method

If the observed implementations have a single method (besides the ones from `Object`), it's possible to dispatch directly at the offset and introduce a guard to deoptimize if the itable is larger than expected.

Caching

At JIT compilation time, it's possible to determine the offsets of each observed implementation and create a `TypeSwitch` that does the dispatch directly at the specific offsets. This is a simple strategy but might be cheaper for some cases, especially since Hotspost doesn't share itable stubs in most architectures and thus doesn't leverage branch prediction well.

### Better `TypeSwitch`

Graal currently compiles `TypeSwitch` nodes using linear search (sequential switch strategy). I've been exploring an approach to compile `TypeSwitch`es based on the addresses of the classes. That would allow Graal to evaluate other switch strategies and choose the cheapest one. Given that some of the invokeinterface strategies would use `TypeSwitch` nodes, optimizing it is important to make them effective.

Additionally, I've been working on a new switch compilation strategy based on hashing that would work well for sparse keys (which is the case for the class addresses).

## Development phases

If you're ok with the overall approach, I'm planning on submitting multiple pull requests:

1. Introduce the new hashing strategy to compile switches
2. Implement a better heuristic to decide between size of the switch tables and cost of the strategies.
3. Make the switch strategies accept long values (they currently work only for integer) and compile `TypeSwitch` using class addresses so it can leverage the different switch strategies
4. Introduce the mechanism to choose between different `invokeinterface` strategies initially with two simple strategies: Caching and Fallback (that falls back to the itable stubs)
5. Create individual pull requests for each strategy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Proposal] Invoke interface optimizations #893

Problem

Solution

Strategies

Better `TypeSwitch`

Development phases

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Proposal] Invoke interface optimizations #893

Description

Problem

Solution

Strategies

Better TypeSwitch

Development phases

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Better `TypeSwitch`