Java GC, generational hypothesis and throughput

In recent years Java and Hotspot JVM in particular received several new garbage collectors. Most recently generational Z collector was introduced.

It is common knowledge that it is possible to achieve almost any results in artificial benchmarks, but nonetheless it may be interesting to create them anyway.

@State(Scope.Thread)
@Fork(value = 1, jvmArgs = {"-Xmx100m"})
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
public class GenerationsBenchmark {

    public static final int OBJECTS_1MB = 1024 * 1024 / 16;
    private ArrayList<Object> olds;

    @Param({"0", "20", "40", "60"})
    public int oldObjects;

    @Setup
    public void setup() {
        olds = new ArrayList<>(oldObjects * OBJECTS_1MB);
        for (int i = 0; i < oldObjects * OBJECTS_1MB; i++) {
            olds.add(new Object());
        }
    }

    @Benchmark
    @Fork(jvmArgsPrepend = {"-XX:+UseG1GC"})
    public void g1(Blackhole bh) {
        bench(bh);
    }

    private static void bench(Blackhole bh) {
        for (int i = 0; i < OBJECTS_1MB; i++) {
            bh.consume(new Object());
        }
    }

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder().include(".*" + GenerationsBenchmark.class.getSimpleName() + ".*").build();

        new Runner(opt).run();
    }
}

First of all this benchmark is flawed that it is run with 1 fork of each configuration only. But results are quite stable and difference between configurations is so large, that we may save the time. Also, 100 MB of heap is not a common size for modern application, so this bench is truly artificial and micro. Some garbage collectors are not suited for such a small heap, while the others are optimized for this extreme too. So the comparison is not fair at all. Any judgments based on it should be applicable only in a similar very unusual situation.

Besides small heap of 100MB various amounts of it are occupied by always referenced objects. In different runs the total amount of such objects varies between 0 and 60MB (or even slightly more acounting for ArrayList referencing them).

But having said all of the above here are the results.

Benchmark                       (oldObjects)   Mode  Cnt      Score     Error  Units
GenerationsBenchmark.g1                    0  thrpt    5  12434.452 ±  80.775  ops/s
GenerationsBenchmark.g1                   20  thrpt    5  12415.055 ± 167.382  ops/s
GenerationsBenchmark.g1                   40  thrpt    5  11219.861 ± 165.378  ops/s
GenerationsBenchmark.g1                   60  thrpt    5   9816.738 ± 962.618  ops/s

G1 has been the default garbage collector for many years by now. We may consider its results as reference. But since our test is measuring throughput lets see the results of throughput optimized collector - Parallel. Here is the test for it.

    @Benchmark
    @Fork(jvmArgsPrepend = {"-XX:+UseParallelGC"})
    public void parallelDefault(Blackhole bh) {
        bench(bh);
    }

With the results.

GenerationsBenchmark.parallelDefault       0  thrpt    5  12712.576 ± 144.766  ops/s
GenerationsBenchmark.parallelDefault      20  thrpt    5  10652.029 ± 210.238  ops/s
GenerationsBenchmark.parallelDefault      40  thrpt    5    411.664 ±  30.261  ops/s
GenerationsBenchmark.parallelDefault      60  thrpt    5    180.962 ±  12.367  ops/s

They are a bit disappointing since it cannot deal with large enough old heap region… But the total heap is too small for Parallel collector - lets try Serial that should be basically the same, but without multiple threads overhead.

    @Benchmark
    @Fork(jvmArgsPrepend = {"-XX:+UseSerialGC"})
    public void serialDefault(Blackhole bh) {
        bench(bh);
    }

Results are:

GenerationsBenchmark.serialDefault         0  thrpt    5  13048.856 ± 131.335  ops/s
GenerationsBenchmark.serialDefault        20  thrpt    5  13023.243 ± 266.766  ops/s
GenerationsBenchmark.serialDefault        40  thrpt    5  13051.316 ±  91.953  ops/s
GenerationsBenchmark.serialDefault        60  thrpt    5    234.666 ±   8.511  ops/s

It seems that young and old region sizes are different for Serial and Parallel collectors. What about trying to tune them ourselves?

    @Benchmark
    @Fork(jvmArgsPrepend = {"-XX:+UseSerialGC"}, jvmArgsAppend = {"-XX:NewSize=10m"})
    public void serial10MBNew(Blackhole bh) {
        bench(bh);
    }

And

GenerationsBenchmark.serial10MBNew         0  thrpt    5  12547.985 ±  67.592  ops/s
GenerationsBenchmark.serial10MBNew        20  thrpt    5  12490.781 ± 202.933  ops/s
GenerationsBenchmark.serial10MBNew        40  thrpt    5  12273.471 ± 109.860  ops/s
GenerationsBenchmark.serial10MBNew        60  thrpt    5  12186.136 ± 382.122  ops/s

Throughput for the no or small old heap region decreased a bit, but we managed to get rid of tremendous regression in the large old region case. I suppose G1 collector was capable of dealing in all these cases without manual tuning due to its heap regions dynamically changing to belong to young or old generation. However, lets try to tune Parallel collector in a similar way.

    @Benchmark
    @Fork(jvmArgsPrepend = {"-XX:+UseParallelGC"}, jvmArgsAppend = {"-XX:NewSize=10m"})
    public void parallel10MBNew(Blackhole bh) {
        bench(bh);
    }

And

GenerationsBenchmark.parallel10MBNew       0  thrpt    5  11286.414 ± 741.980  ops/s
GenerationsBenchmark.parallel10MBNew      20  thrpt    5   6177.843 ± 663.973  ops/s
GenerationsBenchmark.parallel10MBNew      40  thrpt    5   4595.050 ± 227.420  ops/s
GenerationsBenchmark.parallel10MBNew      60  thrpt    5   3869.422 ± 138.329  ops/s

The results are much better than without manual tuning, but still degradation is tremendous. We should not use Parallel gc with so small heap.

After these tests lets look at modern non-generational collectors.

    @Benchmark
    @Fork(jvmArgsPrepend = {"-XX:+UseShenandoahGC"})
    public void shenandoah(Blackhole bh) {
        bench(bh);
    }

And

GenerationsBenchmark.shenandoah            0  thrpt    5    288.230 ± 135.077  ops/s
GenerationsBenchmark.shenandoah           20  thrpt    5    531.991 ±  85.333  ops/s
GenerationsBenchmark.shenandoah           40  thrpt    5    394.158 ±  44.923  ops/s
GenerationsBenchmark.shenandoah           60  thrpt    5    197.308 ±  23.372  ops/s

As we can see Shenandoah is not suited for throughput oriented workloads with so small heaps.

How about Z garbage collector?

    @Benchmark
    @Fork(jvmArgsPrepend = {"-XX:+UseZGC"})
    public void zgc(Blackhole bh) {
        bench(bh);
    }

And

GenerationsBenchmark.zgc                   0  thrpt    5   7753.977 ± 275.895  ops/s
GenerationsBenchmark.zgc                  20  thrpt    5   1906.405 ±  56.030  ops/s
GenerationsBenchmark.zgc                  40  thrpt    5    606.713 ±   4.973  ops/s
GenerationsBenchmark.zgc                  60  thrpt    5     27.959 ±   0.774  ops/s

Results are much better, though old objects in heap cause degradation. However, I have started my post with a link to generational Z collector introduction, so lets see how it will deal with the task.

    @Benchmark
    @Fork(jvmArgsPrepend = {"-XX:+UseZGC", "-XX:+ZGenerational"})
    public void zgcGen(Blackhole bh) {
        bench(bh);
    }

And

GenerationsBenchmark.zgcGen                0  thrpt    5  10271.648 ± 308.686  ops/s
GenerationsBenchmark.zgcGen               20  thrpt    5   7874.979 ± 622.759  ops/s
GenerationsBenchmark.zgcGen               40  thrpt    5   4300.210 ± 367.747  ops/s
GenerationsBenchmark.zgcGen               60  thrpt    5    481.354 ±  94.315  ops/s

Definitely better and in degraded extreme case better than its direct competitor Shenandoah.

It seems that we can try manual tuning with generational Z collector as well.

    @Benchmark
    @Fork(jvmArgsPrepend = {"-XX:+UseZGC", "-XX:+ZGenerational"}, jvmArgsAppend = {"-XX:NewSize=10m"})
    public void zgcGen10MBNew(Blackhole bh) {
        bench(bh);
    }

And

GenerationsBenchmark.zgcGen10MBNew         0  thrpt    5  10362.045 ± 484.911  ops/s
GenerationsBenchmark.zgcGen10MBNew        20  thrpt    5   7856.814 ± 488.436  ops/s
GenerationsBenchmark.zgcGen10MBNew        40  thrpt    5   4400.990 ± 292.039  ops/s
GenerationsBenchmark.zgcGen10MBNew        60  thrpt    5    508.789 ±  28.856  ops/s

Absolutely no difference. It seems that NewSize option specifies only start size of new generation heap region and it will change after some iterations. Just like in G1 collector, but with not that level of success yet.

To sum up - in this non-realistic artificial case it is seen that default collector is the best since it does not need any tuning options. And in my experience tuning options for GC are obsolete and do not reflect the modern needs of the application. So I prefer to not tune anything and always question not only permgen size specified. (Note: in modern Hotspot JVM there is no permgen so this configuration presence is a clear indicator that GC options need revisiting).

Though all the details and numbers should not be relied on, I must say on more thing - about my testing environment. I run all the tests on M1Pro Macbook with these versions:

# JMH version: 1.36
# VM version: JDK 21.0.1, OpenJDK 64-Bit Server VM, 21.0.1+12-LTS