Skip to content

Conversation

@shipilev
Copy link
Member

@shipilev shipilev commented Sep 23, 2025

Forked from JDK-8366681: there are still some cleanups/performance improvements possible. Current selection code is a bit hairy, and turns out the changes I made for previous patch improve performance.

Notable improvements:

  1. Push the compilation level filters downwards. This allows compiling A2 from T2/T3 code more easily, and allows to implement policies for compiling on any A* level based on observing top-compiled T* levels.
  2. Sort methods by hotness and code size. This looks to have a positive effect on shorter workloads, I suspect because we are avoiding a lot of C1 compilations by preloading hottest code first.

Additional testing:

  • Performance tests (see comments)
  • Linux x86_64 server fastdebug, runtime/cds

Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed (1 review required, with at least 1 Committer)

Issue

  • JDK-8368465: [leyden] Improve precompiler method selection code (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/leyden.git pull/99/head:pull/99
$ git checkout pull/99

Update a local copy of the PR:
$ git checkout pull/99
$ git pull https://git.openjdk.org/leyden.git pull/99/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 99

View PR using the GUI difftool:
$ git pr show -t 99

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/leyden/pull/99.diff

Using Webrev

Link to Webrev Comment

@shipilev
Copy link
Member Author

shipilev commented Sep 23, 2025

javac test (1000 iterations trained, 50 iterations production)

# --- Before

Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -XX:AOTCache=app.aot -Xms64m -Xmx1g -XX:+UseSerialGC -cp JavacBenchApp.jar JavacBenchApp 50
  Time (mean ± σ):     338.2 ms ±   3.5 ms    [User: 742.4 ms, System: 120.6 ms]
  Range (min … max):   332.3 ms … 342.9 ms    10 runs
 
Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC -cp JavacBenchApp.jar -XX:+UnlockExperimentalVMOptions -XX:+PreloadOnly -XX:AOTCache=app.aot JavacBenchApp 50
  Time (mean ± σ):     497.2 ms ±   4.1 ms    [User: 491.6 ms, System: 55.5 ms]
  Range (min … max):   489.7 ms … 502.3 ms    10 runs
 

# --- After

Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -XX:AOTCache=app.aot -Xms64m -Xmx1g -XX:+UseSerialGC -cp JavacBenchApp.jar JavacBenchApp 50
  Time (mean ± σ):     322.8 ms ±   2.2 ms    [User: 511.0 ms, System: 101.3 ms]
  Range (min … max):   319.1 ms … 325.9 ms    10 runs
 
Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC -cp JavacBenchApp.jar -XX:+UnlockExperimentalVMOptions -XX:+PreloadOnly -XX:AOTCache=app.aot JavacBenchApp 50
  Time (mean ± σ):     483.0 ms ±   4.5 ms    [User: 476.9 ms, System: 55.4 ms]
  Range (min … max):   476.5 ms … 492.0 ms    10 runs

user time significantly improves, I think that's because we manage to preload the hottest code before C1 discovers it needs to compile it. See how loading is half of the workload time, and "before" C1 was more active:

Before:
plot-javac50-before

After:
plot-javac50-after

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 23, 2025

👋 Welcome back shade! A progress list of the required criteria for merging this PR into premain will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@shipilev
Copy link
Member Author

shipilev commented Sep 23, 2025

Larger benchmarks all improve with 1-core tests.

quarkus-getting-started:

Run,Old CDS + AOT,New CDS + AOT
1,235,190
2,234,198
3,224,189
4,227,185
5,224,199
6,226,192
7,234,193
8,219,181
9,218,196
10,214,197
Geomean,225.40,191.92 (1.17x improvement)
Stdev,6.87,5.57

helidon-quickstart-se

Run,Old CDS + AOT,New CDS + AOT
1,196,167
2,191,166
3,195,166
4,196,173
5,199,165
6,203,169
7,199,168
8,200,162
9,198,167
10,199,173
Geomean,197.58,167.57 (1.18x improvement)
Stdev,3.10,3.23

micronaut-first-app

Run,Old CDS + AOT,New CDS + AOT
1,239,239
2,247,246
3,258,237
4,262,229
5,250,234
6,233,233
7,249,236
8,256,239
9,250,233
10,259,241
Geomean,250.15,236.66 (1.06x improvement)

spring-boot-getting-started:

Run,Old CDS + AOT,New CDS + AOT
1,494,450
2,489,440
3,492,441
4,494,426
5,480,440
6,501,441
7,521,441
8,483,443
9,486,435
10,490,441
Geomean,492.88,439.76 (1.12x improvement)
Stdev,10.93,5.78

spring-petclinic:

Run,Old CDS + AOT,New CDS + AOT
1,2691,2407
2,2659,2451
3,2559,2439
4,2645,2433
5,2667,2475
6,2650,2447
7,2647,2454
8,2667,2439
9,2635,2440
10,2649,2446
Geomean,2646.69,2443.05 (1.08x improvement)
Stdev,32.84,16.34

@openjdk
Copy link

openjdk bot commented Sep 23, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 23, 2025
@mlbridge
Copy link

mlbridge bot commented Sep 23, 2025

Webrevs

@shipilev
Copy link
Member Author

Ready for review, folks. There are performance benefits of doing this, very apparently.

@shipilev shipilev requested review from iwanowww, veresov and vnkozlov and removed request for iwanowww, veresov and vnkozlov October 17, 2025 07:05
Copy link
Collaborator

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions

if (mtd != nullptr) {
MethodData* md = mtd->final_profile();
if (md != nullptr) {
count += md->backedge_count();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this will put methods with hot loop up front.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intentional: this effectively puts the methods that are profitable to (pre)load first, so they: a) do not linger in interpreter too much; b) do not trigger JIT compilation before AOT code is able to (pre)load. The methods with hot back-branches are those methods :)


default: fatal("%d", _search_level);
}
// Otherwise, break the tie by code size: largest methods go first.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for larger method be first? Can we use compile ID here instead?

Copy link
Member Author

@shipilev shipilev Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my logic was like with the hot methods. If we have lost the game of "preload the AOT code before JIT got triggered", and JIT got triggered, we want to then prioritize larger methods, as they are more likely to take more time to JIT compile. In other words, I think if you lost to JIT timing-wise, you want to preempt the fattest JIT compiles first. But it is only a bet. If we ever record compilation time in nmethods/profiles, we could have used that to break the tie.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure understand how different order of pre/AOT-compilation can affect performance of production run.
We bulk load all "Preload" AOT code - ordering does not matter for it. Even if we load in selected order. It is one thread which do loading and it is blocking (I actually playing with spreading this preload on all compiler threads - did not see much affect on startup).

The only explanation is that preload happens only when C2 compiler threads are initialized (Preload AOT code is C2 compiled code) and it happens simultaneously with C1 threads initialization which could be available sooner for C1 compilation than we finish preloading. Especially on small number of cores machines. I did observed that we start loading A1 and A2 code first (even normal C1 compilations) before we start preload AP4.

Is it what you are trying to solve here?

The invocation counters should be roughly the same for methods without loops (10000 to trigger C2 compilation). They could be different if code was deoptimized and run in interpreter. The only difference id backedge counter. So in this sense you push methods with hot loop up front as we talked about in other comment. Which may affect performance but it would depend on application.

I agree with ordering by size (or time spant in compilation) but only for methods which did not have A1 or A2 code. Which should not be the case - if we have AP4 we will have A1 and A2 for it. I am still not convince by this.

May be we should try to move AOTCodeCache::preload_code() just after SystemDictionary::compute_java_loaders() because it does not depend on training data. So we can have AP4 sooner. MMaping directly into CodeCache will also speedup preloading - it is on our list to do.

@vnkozlov
Copy link
Collaborator

I think we have small performance "issue" how we replace existing JITed code with new one which AOT code loading could be more sensitive. We deoptimize old code before new code is set under lock NMethodState_lock :
https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/ci/ciEnv.cpp#L1058

If lock is held by other thread we may deoptimize previous code and go into interpreter before new code is set for use.

This is present in mainline but with normal JIT compilation replacement it may be not noticeable.

@vnkozlov
Copy link
Collaborator

An other suggestion for this concurrent preloading would be to split A4 preload code. One set is the current which needs to wait compute_java_loaders(). And new one (much smaller) is for simple methods for classes which are loaded first (String, for example) which we can preload much sooner.

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 15, 2025

@shipilev This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

2 participants