In previous post I benchmark a heavily CPU-bound code. It may not seem so, however it appears to be memory bound. I’ve read somewhere that 14900 CPU efficient cores scale worse on memory accesses, than on just cracking numbers. So I wrote the following:
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@Fork(3)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
public class ThreadsBenchmark {
@Param("default")
public String threads;
@Benchmark
public BigInteger bench() {
var result = BigInteger.ONE;
for (int i = 1; i <= 2048; i++) {
result = result.multiply(BigInteger.valueOf(i));
}
return result;
}
public static void main(String[] args) throws RunnerException {
var results = new ArrayList<RunResult>();
for (int i = 1; i <= 36; i++) {
Options opt = new OptionsBuilder()
.include(".*" + ThreadsBenchmark.class.getSimpleName() + ".*")
.addProfiler(GCProfiler.class)
.verbosity(VerboseMode.SILENT)
.param("threads", String.valueOf(i))
.threads(i)
.build();
results.addAll(new Runner(opt).run());
System.out.println("Finished " + i);
}
OutputFormatFactory.createFormatInstance(System.out, Defaults.VERBOSITY).endRun(results);
}
}
And got the following results:
Benchmark (threads) Mode Cnt Score Error Units
ThreadsBenchmark.bench 1 avgt 15 381.051 ± 6.303 us/op
ThreadsBenchmark.bench 2 avgt 15 404.597 ± 4.457 us/op
ThreadsBenchmark.bench 3 avgt 15 429.157 ± 3.651 us/op
ThreadsBenchmark.bench 4 avgt 15 470.670 ± 2.058 us/op
ThreadsBenchmark.bench 5 avgt 15 526.845 ± 3.049 us/op
ThreadsBenchmark.bench 6 avgt 15 595.859 ± 4.137 us/op
ThreadsBenchmark.bench 7 avgt 15 653.487 ± 3.804 us/op
ThreadsBenchmark.bench 8 avgt 15 708.640 ± 6.378 us/op
ThreadsBenchmark.bench 9 avgt 15 794.764 ± 7.892 us/op
ThreadsBenchmark.bench 10 avgt 15 879.006 ± 7.382 us/op
ThreadsBenchmark.bench 11 avgt 15 971.415 ± 8.570 us/op
ThreadsBenchmark.bench 12 avgt 15 1101.470 ± 56.425 us/op
ThreadsBenchmark.bench 13 avgt 15 1115.850 ± 5.051 us/op
ThreadsBenchmark.bench 14 avgt 15 1202.304 ± 5.871 us/op
ThreadsBenchmark.bench 15 avgt 15 1299.112 ± 10.511 us/op
ThreadsBenchmark.bench 16 avgt 15 1395.623 ± 10.643 us/op
ThreadsBenchmark.bench 17 avgt 15 1487.989 ± 10.312 us/op
ThreadsBenchmark.bench 18 avgt 15 1589.793 ± 10.064 us/op
ThreadsBenchmark.bench 19 avgt 15 1684.144 ± 11.208 us/op
ThreadsBenchmark.bench 20 avgt 15 1774.493 ± 18.495 us/op
ThreadsBenchmark.bench 21 avgt 15 1875.067 ± 16.203 us/op
ThreadsBenchmark.bench 22 avgt 15 1973.474 ± 11.102 us/op
ThreadsBenchmark.bench 23 avgt 15 2072.661 ± 16.053 us/op
ThreadsBenchmark.bench 24 avgt 15 2171.723 ± 14.909 us/op
ThreadsBenchmark.bench 25 avgt 15 2249.195 ± 16.467 us/op
ThreadsBenchmark.bench 26 avgt 15 2327.629 ± 11.873 us/op
ThreadsBenchmark.bench 27 avgt 15 2417.717 ± 11.460 us/op
ThreadsBenchmark.bench 28 avgt 15 2507.770 ± 7.841 us/op
ThreadsBenchmark.bench 29 avgt 15 2593.257 ± 15.216 us/op
ThreadsBenchmark.bench 30 avgt 15 2668.016 ± 10.484 us/op
ThreadsBenchmark.bench 31 avgt 15 2755.329 ± 16.218 us/op
ThreadsBenchmark.bench 32 avgt 15 2824.268 ± 6.769 us/op
ThreadsBenchmark.bench 33 avgt 15 2923.594 ± 13.014 us/op
ThreadsBenchmark.bench 34 avgt 15 3021.506 ± 22.468 us/op
ThreadsBenchmark.bench 35 avgt 15 3106.939 ± 11.508 us/op
ThreadsBenchmark.bench 36 avgt 15 3193.335 ± 14.416 us/op
I’ve omitted GCProfilers output. Norm allocation rate is constant across benchmarks as it should be and is equal to 4173400 B/op. Total allocation rate is more interesting:
Benchmark (threads) Mode Cnt Score Error Units
ThreadsBenchmark.bench:·gc.alloc.rate 1 avgt 15 10445.158 ± 171.097 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 2 avgt 15 19668.996 ± 215.412 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 3 avgt 15 27810.120 ± 234.417 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 4 avgt 15 33800.293 ± 149.677 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 5 avgt 15 37741.635 ± 222.418 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 6 avgt 15 40045.123 ± 274.842 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 7 avgt 15 42593.372 ± 239.159 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 8 avgt 15 44896.284 ± 388.314 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 9 avgt 15 45246.806 ± 425.496 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 10 avgt 15 45570.187 ± 450.535 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 11 avgt 15 45503.650 ± 375.426 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 12 avgt 15 43901.752 ± 2187.294 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 13 avgt 15 46962.621 ± 179.555 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 14 avgt 15 46922.119 ± 183.105 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 15 avgt 15 46607.365 ± 352.800 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 16 avgt 15 46193.735 ± 291.412 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 17 avgt 15 46032.232 ± 246.677 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 18 avgt 15 45633.265 ± 184.391 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 19 avgt 15 45489.668 ± 155.867 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 20 avgt 15 45376.666 ± 297.628 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 21 avgt 15 45100.441 ± 218.035 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 22 avgt 15 45053.339 ± 279.386 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 23 avgt 15 44801.914 ± 216.204 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 24 avgt 15 44563.037 ± 249.170 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 25 avgt 15 44691.730 ± 193.041 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 26 avgt 15 44813.572 ± 155.758 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 27 avgt 15 44807.594 ± 178.611 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 28 avgt 15 44730.577 ± 136.466 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 29 avgt 15 44797.707 ± 263.377 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 30 avgt 15 45044.544 ± 165.125 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 31 avgt 15 45031.436 ± 193.781 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 32 avgt 15 45254.605 ± 107.844 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 33 avgt 15 45205.982 ± 115.027 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 34 avgt 15 45121.584 ± 304.400 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 35 avgt 15 45162.079 ± 119.766 MB/sec
ThreadsBenchmark.bench:·gc.alloc.rate 36 avgt 15 45257.289 ± 166.231 MB/sec
It rises with using more threads up to approximately 8 (CPU has 8 physical performance cores) and stay aroung 45 thousand MB/sec after that.
As in the previous post these results may be valuable to investigate potential degradations of serving many concurrent requests on the server, but since total amount of work is different in different benchmarks, they are hard to compare. So I’ve written also the following code:
@BenchmarkMode(Mode.AverageTime)
@Fork(3)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
public class Threads2Benchmark {
@Param({"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16",
"17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32",
"33", "34", "35", "36",
})
public String threads;
private static final int FACTORIALS = 128;
@Benchmark
public List<BigInteger> bench() throws ExecutionException, InterruptedException {
try (var executor = Executors.newFixedThreadPool(Integer.parseInt(threads))) {
var futures = new ArrayList<Future<BigInteger>>(FACTORIALS);
for (int i = 0; i < FACTORIALS; i++) {
futures.add(executor.submit(() -> {
var result = BigInteger.valueOf(1);
for (int j = 1; j <= 2048; j++) {
result = result.multiply(BigInteger.valueOf(j));
}
return result;
}));
}
var result = new ArrayList<BigInteger>(FACTORIALS);
for (int i = 0; i < FACTORIALS; i++) {
result.add(futures.get(i).get());
}
return result;
}
}
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(".*" + Threads2Benchmark.class.getSimpleName() + ".*")
//From the comment to GCProfiler internals:
//The problem is threads can come and go while performing the benchmark,
//thus we would miss allocations made in a thread that was created and died between the snapshots.
//So this profiler reports zeros for pur benchmark
//.addProfiler(GCProfiler.class)
.build();
new Runner(opt).run();
}
}
GCProfiler cannot help with it, however from the results it is clear that allocations were not removed by DCE. This benchmark produces the following result:
Benchmark (threads) Mode Cnt Score Error Units
ThreadsBenchmark.bench 1 avgt 15 51.556 ± 0.803 ms/op
ThreadsBenchmark.bench 2 avgt 15 30.049 ± 1.001 ms/op
ThreadsBenchmark.bench 3 avgt 15 22.340 ± 0.407 ms/op
ThreadsBenchmark.bench 4 avgt 15 18.544 ± 0.286 ms/op
ThreadsBenchmark.bench 5 avgt 15 16.336 ± 0.184 ms/op
ThreadsBenchmark.bench 6 avgt 15 15.114 ± 0.184 ms/op
ThreadsBenchmark.bench 7 avgt 15 14.417 ± 0.144 ms/op
ThreadsBenchmark.bench 8 avgt 15 13.758 ± 0.098 ms/op
ThreadsBenchmark.bench 9 avgt 15 12.940 ± 0.072 ms/op
ThreadsBenchmark.bench 10 avgt 15 12.500 ± 0.053 ms/op
ThreadsBenchmark.bench 11 avgt 15 12.260 ± 0.062 ms/op
ThreadsBenchmark.bench 12 avgt 15 12.122 ± 0.052 ms/op
ThreadsBenchmark.bench 13 avgt 15 11.931 ± 0.040 ms/op
ThreadsBenchmark.bench 14 avgt 15 11.929 ± 0.037 ms/op
ThreadsBenchmark.bench 15 avgt 15 11.812 ± 0.064 ms/op
ThreadsBenchmark.bench 16 avgt 15 11.836 ± 0.043 ms/op
ThreadsBenchmark.bench 17 avgt 15 11.846 ± 0.038 ms/op
ThreadsBenchmark.bench 18 avgt 15 11.801 ± 0.033 ms/op
ThreadsBenchmark.bench 19 avgt 15 11.831 ± 0.027 ms/op
ThreadsBenchmark.bench 20 avgt 15 11.831 ± 0.026 ms/op
ThreadsBenchmark.bench 21 avgt 15 11.770 ± 0.067 ms/op
ThreadsBenchmark.bench 22 avgt 15 11.813 ± 0.022 ms/op
ThreadsBenchmark.bench 23 avgt 15 11.770 ± 0.040 ms/op
ThreadsBenchmark.bench 24 avgt 15 11.724 ± 0.037 ms/op
ThreadsBenchmark.bench 25 avgt 15 11.730 ± 0.015 ms/op
ThreadsBenchmark.bench 26 avgt 15 11.711 ± 0.036 ms/op
ThreadsBenchmark.bench 27 avgt 15 11.686 ± 0.024 ms/op
ThreadsBenchmark.bench 28 avgt 15 11.655 ± 0.029 ms/op
ThreadsBenchmark.bench 29 avgt 15 11.664 ± 0.014 ms/op
ThreadsBenchmark.bench 30 avgt 15 11.672 ± 0.081 ms/op
ThreadsBenchmark.bench 31 avgt 15 11.650 ± 0.018 ms/op
ThreadsBenchmark.bench 32 avgt 15 11.674 ± 0.012 ms/op
ThreadsBenchmark.bench 33 avgt 15 11.680 ± 0.034 ms/op
ThreadsBenchmark.bench 34 avgt 15 11.697 ± 0.040 ms/op
ThreadsBenchmark.bench 35 avgt 15 11.731 ± 0.020 ms/op
ThreadsBenchmark.bench 36 avgt 15 11.721 ± 0.038 ms/op
The most gain is got from using the second thread of heavily memory allocation bound work. Probably two channels of RAM have something to deal with this result. However, adding even more threads gives even more speed increase.
General conclusion that I can make is that we should not worry about efficient cores - even if they are not helping, they do not make things worse. And in some cases they can help. So usual rule of thumb stays as it was - use all the cores/threads of the CPU and try not to only in some really specific cases. Just like previously in some specific cases we turned off hyper-threading.