Skip to content

Refactor engine JIT stage #2806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Refactor engine JIT stage #2806

wants to merge 10 commits into from

Conversation

timcassell
Copy link
Collaborator

@timcassell timcassell commented Jul 9, 2025

Fixes #2004
Fixes #1466
Contributes to #2787, #1993, #1780, #1210

  • Moved JIT stage from EngineFactory to a proper EngineJitStage.
    • JIT stage now attempts to push the benchmarked method through all JIT tiers.
  • Moved heuristic from EngineFactory to a new pilot stage (JIT stage, according to its name, now only focuses on jitting).
    • Fixed the heuristic to never include the first invocation.
  • Cleanup around IEngine (breaking changes).
  • Improved check for LegacyJit.

yield return GetOverheadNoUnrollIterationData();
yield return GetDummyIterationData(dummy2Action);
yield return GetWorkloadNoUnrollIterationData();
yield return GetDummyIterationData(dummy3Action);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndreyAkinshin You added dummy actions in 2017. I don't know what they are for. Do we still need them?

@timcassell
Copy link
Collaborator Author

cc @AndyAyersMS @EgorBo

@EgorBo
Copy link
Member

EgorBo commented Jul 13, 2025

JIT stage now attempts to push the benchmarked method through all JIT tiers.
Set environment variable for the runtime to enable aggressive tiering by default.

Honestly, I think you shouldn't use TC_AggressiveTiering, just 1 iteration to promote to Tier1 is mostly just for internal testing. I think CallCountingDelayMs=0 should be enough.

@timcassell
Copy link
Collaborator Author

Honestly, I think you shouldn't use TC_AggressiveTiering, just 1 iteration to promote to Tier1 is mostly just for internal testing. I think CallCountingDelayMs=0 should be enough.

Can you elaborate on that? Why would we need more than 1 invocation per tier for throughput benchmarks? 30 invocations is too much for the stage to complete in a timely manner for long-running benchmarks.

Also, I tried CallCountingDelayMs=0, but it breaks the disassembler (dotnet/runtime#117339).

@EgorBo
Copy link
Member

EgorBo commented Jul 13, 2025

Can you elaborate on that?

I think the profile will not be representable (a benchmark may invoke the same method from different places and we don't have context-sensitive profiling yet) + we have optimizations like we intentionally make call counting for some methods smaller so their callers are guaranteed to be promoted later (it's for some internal calls so we can bake final addresses of their Tier1 code versions directly instead of having indirect calls), although, I am mostly concerned about PGO quality.

@timcassell
Copy link
Collaborator Author

Thanks, that makes sense. I guess I can remove that env var and just run the jit stage with a timeout, and if it doesn't fully reach tier1, we can allow the pilot/warmup stages to handle it later (#1210).

Can you also verify the logic in JitInfo.cs?

@EgorBo
Copy link
Member

EgorBo commented Jul 13, 2025

Thanks, that makes sense. I guess I can remove that env var and just run the jit stage with a timeout, and if it doesn't fully reach tier1

How do you check that? I don't think there is a way to check whether a benchmark and all of its callees are fully warmed up

@timcassell
Copy link
Collaborator Author

Thanks, that makes sense. I guess I can remove that env var and just run the jit stage with a timeout, and if it doesn't fully reach tier1

How do you check that? I don't think there is a way to check whether a benchmark and all of its callees are fully warmed up

We don't. We just run a number of invocations based on the configured values retrieved from JitInfo and hope for the best. The pilot/warmup stages will have to work with some sort of heuristic to try to determine if tiering caused the measured time to significantly drop.

@timcassell timcassell force-pushed the jit-stage branch 2 times, most recently from 8a143d6 to f02de9b Compare July 13, 2025 21:42
@timcassell
Copy link
Collaborator Author

dotnet/runtime#117787 (comment)

The "third tier" you see may be OSR, since your method loops a lot and isn't called often.

@AndyAyersMS (to not derail that issue), how can we account for OSR in the jit stage here?

@EgorBo
Copy link
Member

EgorBo commented Jul 17, 2025

dotnet/runtime#117787 (comment)

The "third tier" you see may be OSR, since your method loops a lot and isn't called often.

@AndyAyersMS (to not derail that issue), how can we account for OSR in the jit stage here?

I think for BDN specifically OSR is just some intermediate tier it doesn't have to care about, it shouldn't impact the Tier0->Tier1 promotion velocity. Since the method is too slow, I guess BDN decided not too call it too many times?

@timcassell
Copy link
Collaborator Author

Since the method is too slow, I guess BDN decided not too call it too many times?

This is purely for the jit stage, where the number of invocations are fixed (in an attempt to push it through all tiers). I'm not sure what the jit thinks is not called enough times. Perhaps because of how the stages work, it only invokes once per iteration, and the jit can't see that the iterations are being ran multiple times? If we called it through the WorkloadUnroll method (with unrollFactor = 16), the jit would skip the OSR?

@timcassell
Copy link
Collaborator Author

timcassell commented Jul 17, 2025

I think for BDN specifically OSR is just some intermediate tier it doesn't have to care about, it shouldn't impact the Tier0->Tier1 promotion velocity.

That's what I thought, but the evidence shows otherwise. It took 60 invocations to fully reach tier1, instead of 30 (DPGO disabled).

@AndyAyersMS
Copy link
Member

Did you try profiling the example from dotnet/runtime#117787? If not, I can do it soonish.

@timcassell
Copy link
Collaborator Author

Did you try profiling the example from dotnet/runtime#117787? If not, I can do it soonish.

Nope, I don't have much experience to know what to look for. If you're going to do it from this branch, add +2 to remainingTiers in the jit stage to see results of all tiers. Appreciate it.

Don't jit overhead methods if the job is configured to not measure it.
Remove extra call counting delay for in-process benchmarks.
Set CallCountingDelayMs env var if DisassemblyDiagnoser is not used.
Added a test for very long first invocation time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wildly different results from simply swapping the order of benchmarks Unoptimized code is used for benchmark
3 participants