So, I got bored with waiting for tests of my laptop and built myself a new desktop for the first time in 10 years. I’ve built it around 14900K CPU and no separate GPU at all. For the OS I use Ubuntu 24.04. And I wanted to look closer to the differences between cores in this CPU. It has 8 performance cores and 16 efficient cores. So I wrote this benchmark to test them.

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@Fork(3)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
public class ThreadsBenchmark {

    @Param("default")
    public String threads;

    @Benchmark
    public byte[] bench() {
        var random = RandomGenerator.getDefault();
        var bytes = new byte[1024];
        for (int i = 0; i < 1024; i++) {
            random.nextBytes(bytes);
        }
        return bytes;
    }

    public static void main(String[] args) throws RunnerException {
        var results = new ArrayList<RunResult>();
        for (int i = 1; i <= 36; i++) {
            Options opt = new OptionsBuilder()
                    .include(".*" + ThreadsBenchmark.class.getSimpleName() + ".*")
                    .verbosity(VerboseMode.SILENT)
                    .param("threads", String.valueOf(i))
                    .threads(i)
                    .build();
            results.addAll(new Runner(opt).run());
            System.out.println("Finished " + i);
        }

        OutputFormatFactory.createFormatInstance(System.out, Defaults.VERBOSITY).endRun(results);
    }
}

I’ve run it with

$ java -version
openjdk version "21.0.5" 2024-10-15
OpenJDK Runtime Environment (build 21.0.5+11-Ubuntu-1ubuntu124.04)
OpenJDK 64-Bit Server VM (build 21.0.5+11-Ubuntu-1ubuntu124.04, mixed mode, sharing)

And got the following results:

Benchmark               (threads)  Mode  Cnt     Score    Error  Units
ThreadsBenchmark.bench          1  avgt   15   456.827 ±  1.539  us/op
ThreadsBenchmark.bench          2  avgt   15   478.378 ±  0.669  us/op
ThreadsBenchmark.bench          3  avgt   15   478.158 ±  2.052  us/op
ThreadsBenchmark.bench          4  avgt   15   478.451 ±  2.894  us/op
ThreadsBenchmark.bench          5  avgt   15   479.349 ±  2.073  us/op
ThreadsBenchmark.bench          6  avgt   15   477.210 ±  2.160  us/op
ThreadsBenchmark.bench          7  avgt   15   476.744 ±  1.122  us/op
ThreadsBenchmark.bench          8  avgt   15   467.190 ± 21.820  us/op
ThreadsBenchmark.bench          9  avgt   15   510.866 ± 26.721  us/op
ThreadsBenchmark.bench         10  avgt   15   527.802 ± 31.445  us/op
ThreadsBenchmark.bench         11  avgt   15   558.562 ± 33.271  us/op
ThreadsBenchmark.bench         12  avgt   15   583.522 ± 38.983  us/op
ThreadsBenchmark.bench         13  avgt   15   586.813 ± 38.401  us/op
ThreadsBenchmark.bench         14  avgt   15   601.645 ± 40.331  us/op
ThreadsBenchmark.bench         15  avgt   15   616.489 ± 37.435  us/op
ThreadsBenchmark.bench         16  avgt   15   626.235 ± 39.271  us/op
ThreadsBenchmark.bench         17  avgt   15   640.925 ± 45.910  us/op
ThreadsBenchmark.bench         18  avgt   15   653.792 ± 45.098  us/op
ThreadsBenchmark.bench         19  avgt   15   649.266 ± 36.483  us/op
ThreadsBenchmark.bench         20  avgt   15   661.931 ± 40.578  us/op
ThreadsBenchmark.bench         21  avgt   15   675.848 ± 38.651  us/op
ThreadsBenchmark.bench         22  avgt   15   689.341 ± 35.174  us/op
ThreadsBenchmark.bench         23  avgt   15   705.485 ± 31.321  us/op
ThreadsBenchmark.bench         24  avgt   15   720.715 ± 30.871  us/op
ThreadsBenchmark.bench         25  avgt   15   745.423 ± 35.487  us/op
ThreadsBenchmark.bench         26  avgt   15   772.899 ± 30.168  us/op
ThreadsBenchmark.bench         27  avgt   15   796.469 ± 34.395  us/op
ThreadsBenchmark.bench         28  avgt   15   822.673 ± 33.470  us/op
ThreadsBenchmark.bench         29  avgt   15   839.802 ± 34.424  us/op
ThreadsBenchmark.bench         30  avgt   15   863.312 ± 34.641  us/op
ThreadsBenchmark.bench         31  avgt   15   876.696 ± 38.417  us/op
ThreadsBenchmark.bench         32  avgt   15   897.369 ± 35.127  us/op
ThreadsBenchmark.bench         33  avgt   15   925.832 ± 39.338  us/op
ThreadsBenchmark.bench         34  avgt   15   961.630 ± 37.463  us/op
ThreadsBenchmark.bench         35  avgt   15   989.904 ± 39.702  us/op
ThreadsBenchmark.bench         36  avgt   15  1031.029 ± 44.007  us/op

I expected to get some serious shifts between 8 and 9 threads since I have 8 performance cores. And basically get that.

Next shift I expected between 16 and 17 threads, since I have 8 performance cores with hyper-threading. However, got nothing like that. Performance on each thread slowly dropped with each new thread added to the bench.

This performance drop on each thread continued after number of threads exceeded the number of CPU threads.

From what I can see it seems that there is no need to profit to limit concurrency on this CPU to the number of performance threads at least in some cases demonstrated here.

Update: this seems too close to the original post and too small for the separate one. Link to heading

Since the total amount of work is different previous benchmark results are actually hard to compare. So I’ve written another one:

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@Fork(3)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
public class ThreadsBenchmark {
    @Param({"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16",
            "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32",
            "33", "34", "35", "36",
    })
    public String threads;

    private static final int RESULTS = 128;

    @Benchmark
    public List<byte[]> bench() throws ExecutionException, InterruptedException {
        try (var executor = Executors.newFixedThreadPool(Integer.parseInt(threads))) {
            var futures = new ArrayList<Future<byte[]>>(RESULTS);
            for (int i = 0; i < RESULTS; i++) {
                futures.add(executor.submit(() -> {
                    var random = RandomGenerator.getDefault();
                    var bytes = new byte[1024];
                    for (int j = 0; j < 1024; j++) {
                        random.nextBytes(bytes);
                    }
                    return bytes;
                }));
            }
            var result = new ArrayList<byte[]>(RESULTS);
            for (int i = 0; i < RESULTS; i++) {
                result.add(futures.get(i).get());
            }
            return result;
        }
    }
}

This produces the following result:

Benchmark                (threads)  Mode  Cnt   Score   Error  Units
ThreadsBenchmark.bench           1  avgt   15  57.861 ± 1.329  ms/op
ThreadsBenchmark.bench           2  avgt   15  31.279 ± 0.527  ms/op
ThreadsBenchmark.bench           3  avgt   15  26.376 ± 0.926  ms/op
ThreadsBenchmark.bench           4  avgt   15  20.307 ± 0.404  ms/op
ThreadsBenchmark.bench           5  avgt   15  16.930 ± 0.237  ms/op
ThreadsBenchmark.bench           6  avgt   15  15.026 ± 0.358  ms/op
ThreadsBenchmark.bench           7  avgt   15  12.878 ± 0.171  ms/op
ThreadsBenchmark.bench           8  avgt   15  11.145 ± 0.204  ms/op
ThreadsBenchmark.bench           9  avgt   15  10.054 ± 0.181  ms/op
ThreadsBenchmark.bench          10  avgt   15   8.818 ± 0.073  ms/op
ThreadsBenchmark.bench          11  avgt   15   7.942 ± 0.090  ms/op
ThreadsBenchmark.bench          12  avgt   15   7.389 ± 0.173  ms/op
ThreadsBenchmark.bench          13  avgt   15   7.010 ± 0.186  ms/op
ThreadsBenchmark.bench          14  avgt   15   6.285 ± 0.173  ms/op
ThreadsBenchmark.bench          15  avgt   15   5.920 ± 0.071  ms/op
ThreadsBenchmark.bench          16  avgt   15   5.459 ± 0.049  ms/op
ThreadsBenchmark.bench          17  avgt   15   5.347 ± 0.049  ms/op
ThreadsBenchmark.bench          18  avgt   15   5.262 ± 0.043  ms/op
ThreadsBenchmark.bench          19  avgt   15   5.225 ± 0.208  ms/op
ThreadsBenchmark.bench          20  avgt   15   4.773 ± 0.043  ms/op
ThreadsBenchmark.bench          21  avgt   15   4.744 ± 0.022  ms/op
ThreadsBenchmark.bench          22  avgt   15   4.635 ± 0.080  ms/op
ThreadsBenchmark.bench          23  avgt   15   4.643 ± 0.029  ms/op
ThreadsBenchmark.bench          24  avgt   15   4.610 ± 0.026  ms/op
ThreadsBenchmark.bench          25  avgt   15   4.641 ± 0.063  ms/op
ThreadsBenchmark.bench          26  avgt   15   4.597 ± 0.008  ms/op
ThreadsBenchmark.bench          27  avgt   15   4.595 ± 0.019  ms/op
ThreadsBenchmark.bench          28  avgt   15   4.622 ± 0.038  ms/op
ThreadsBenchmark.bench          29  avgt   15   4.524 ± 0.009  ms/op
ThreadsBenchmark.bench          30  avgt   15   4.518 ± 0.048  ms/op
ThreadsBenchmark.bench          31  avgt   15   4.472 ± 0.061  ms/op
ThreadsBenchmark.bench          32  avgt   15   4.455 ± 0.009  ms/op
ThreadsBenchmark.bench          33  avgt   15   4.432 ± 0.096  ms/op
ThreadsBenchmark.bench          34  avgt   15   4.495 ± 0.009  ms/op
ThreadsBenchmark.bench          35  avgt   15   4.516 ± 0.018  ms/op
ThreadsBenchmark.bench          36  avgt   15   4.528 ± 0.023  ms/op

A can draw several conclusions from these results. Really using more than 2x performance cores produce little value. However, nevertheless there is some value in using more cores including efficient ones. Also, there is no penalty at all. So usual technique of using a thread pool of size equal or greater to quantity of logical cores on the CPU should work just fine.