In the previous post I investigated effect of old generation objects on Java garbage collectors. Here I want to run similar test for Go language. It is a bit less interesting since there is only one option, but nevertheless some useful insights can be taken out. Like how much RAM to give for GC breathing. Or maybe is it worth giving some extra GBs of RAM to squeeze out last tiny percents of performance. And when adding more RAM starts making worse (in theory it is easy to achieve this especially if you measure your success in terms of latency and not throughput). However, any complex test is better done with more realistic workload. Here I want to check just the simple assumptions and dependencies.
First, I wrote this simple test to check how half the occupied heap affects garbage collection and overall performance. Half heap ballast
package main
import (
"runtime/debug"
"testing"
"golang.org/x/exp/rand"
)
var BBlackhole []byte
func BenchmarkBallast(b *testing.B) {
debug.SetMemoryLimit(100 * 1024 * 1024) //100 MB
//run everything 10 times to catch possible warmup effects
for bench := 0; bench < 10; bench++ {
b.Run("NoBallast", func(b *testing.B) {
for i := 0; i < b.N; i++ {
BBlackhole = make([]byte, 10*1024) //10KB
}
})
//make ~50MB ballast in heap
olds := make([][]byte, 5)
for i := 0; i < 5; i++ {
olds[i] = make([]byte, 10*1024*1024) //10 MB
}
b.Run("Ballast", func(b *testing.B) {
for i := 0; i < b.N; i++ {
BBlackhole = make([]byte, 10*1024) //10KB
}
})
//"use" ballast so that it would no be cleared earlier
BBlackhole = olds[rand.Intn(5)]
}
}
The results (all the repetitions omitted for brevity - they are almost identical to each other) are quite disappointing.
BenchmarkBallast/NoBallast
BenchmarkBallast/NoBallast-10 1667966 742.1 ns/op
BenchmarkBallast/Ballast
BenchmarkBallast/Ballast-10 3101397 346.6 ns/op
With no old gen ballast performance it almost twice worse, than with it. But that is easily explained by the defaults of Go GC heuristics. Besides new and shiny option GOMEMLIMIT there is old, but still used GOGC option that by default trigger garbage collection after 200% heap is occupied compared to the state after the previous collection. With no old object in the heap that means GC triggering almost non-stop one right after another. Disabling this behaviour with this line (alternatively, just like with memory limit it can be done with env variable)
//to turn off automatic collections on nearly empty heap
debug.SetGCPercent(-1)
We will receive the following results:
BenchmarkBallast/NoBallast
BenchmarkBallast/NoBallast-10 3771654 306.6 ns/op
BenchmarkBallast/Ballast
BenchmarkBallast/Ballast-10 3588614 348.4 ns/op
As expected allocation-bounded benchmark gets throughput performance boost from more free heap and less frequent garbage collections.
package main
import (
"runtime/debug"
"strconv"
"testing"
"golang.org/x/exp/rand"
)
var Blackhole []byte
func BenchmarkAllocaWithOlds(b *testing.B) {
debug.SetMemoryLimit(100 * 1024 * 1024) //100 MB
//to turn off automatic collections on nearly empty heap
debug.SetGCPercent(-1)
for bench := 15; bench >= 0; bench-- {
var olds [][]byte
if bench != 0 {
olds = make([][]byte, bench)
for i := 0; i < bench; i++ {
olds[i] = make([]byte, 10*1024*1024) //10 MB
}
}
b.Run(strconv.Itoa(bench), func(b *testing.B) {
for i := 0; i < b.N; i++ {
Blackhole = make([]byte, 10*1024) //10 KB
}
if bench != 0 {
Blackhole = olds[rand.Intn(bench)]
}
})
olds = nil
}
}
The results are:
BenchmarkAllocaWithOlds
BenchmarkAllocaWithOlds/15
BenchmarkAllocaWithOlds/15-10 611986 2169 ns/op
BenchmarkAllocaWithOlds/14
BenchmarkAllocaWithOlds/14-10 1249198 1013 ns/op
BenchmarkAllocaWithOlds/13
BenchmarkAllocaWithOlds/13-10 1164472 1093 ns/op
BenchmarkAllocaWithOlds/12
BenchmarkAllocaWithOlds/12-10 1000000 1095 ns/op
BenchmarkAllocaWithOlds/11
BenchmarkAllocaWithOlds/11-10 1000000 1131 ns/op
BenchmarkAllocaWithOlds/10
BenchmarkAllocaWithOlds/10-10 1116847 1071 ns/op
BenchmarkAllocaWithOlds/9
BenchmarkAllocaWithOlds/9-10 1073702 1185 ns/op
BenchmarkAllocaWithOlds/8
BenchmarkAllocaWithOlds/8-10 1000000 1080 ns/op
BenchmarkAllocaWithOlds/7
BenchmarkAllocaWithOlds/7-10 2256046 505.5 ns/op
BenchmarkAllocaWithOlds/6
BenchmarkAllocaWithOlds/6-10 3219080 384.3 ns/op
BenchmarkAllocaWithOlds/5
BenchmarkAllocaWithOlds/5-10 3429478 346.5 ns/op
BenchmarkAllocaWithOlds/4
BenchmarkAllocaWithOlds/4-10 3476504 335.5 ns/op
BenchmarkAllocaWithOlds/3
BenchmarkAllocaWithOlds/3-10 3768894 334.3 ns/op
BenchmarkAllocaWithOlds/2
BenchmarkAllocaWithOlds/2-10 3570604 338.5 ns/op
BenchmarkAllocaWithOlds/1
BenchmarkAllocaWithOlds/1-10 3689316 320.5 ns/op
BenchmarkAllocaWithOlds/0
BenchmarkAllocaWithOlds/0-10 3913862 311.2 ns/op
PASS
A bit interesting are the results with >100MB of old objects in a heap with 100MB soft limit. No errors are produced (unlike Java), but essentially garbage collections happens all the time trying to reclaim the RAM. It affects performance. With less than 100MB (or 80MB to be more specific) of old objects in the heap GC overhead decreases.
As a last experiment let’s try to occupy heap not with 1-15 heavy old slices, but with 1-15 thousand.
olds = make([][]byte, bench*1024)
for i := 0; i < bench*1024; i++ {
olds[i] = make([]byte, 10*1024) //10 KB
}
This is much more realistic (unless we are talking about usual before GOMEMLIMIT introduction memory ballast) and affects performace more.
BenchmarkAllocaWithOlds
BenchmarkAllocaWithOlds/15
BenchmarkAllocaWithOlds/15-10 196195 6209 ns/op
BenchmarkAllocaWithOlds/14
BenchmarkAllocaWithOlds/14-10 260227 4549 ns/op
BenchmarkAllocaWithOlds/13
BenchmarkAllocaWithOlds/13-10 319176 3852 ns/op
BenchmarkAllocaWithOlds/12
BenchmarkAllocaWithOlds/12-10 233354 5015 ns/op
BenchmarkAllocaWithOlds/11
BenchmarkAllocaWithOlds/11-10 258919 6355 ns/op
BenchmarkAllocaWithOlds/10
BenchmarkAllocaWithOlds/10-10 200463 5774 ns/op
BenchmarkAllocaWithOlds/9
BenchmarkAllocaWithOlds/9-10 206736 5483 ns/op
BenchmarkAllocaWithOlds/8
BenchmarkAllocaWithOlds/8-10 1000000 1112 ns/op
BenchmarkAllocaWithOlds/7
BenchmarkAllocaWithOlds/7-10 1908793 641.3 ns/op
BenchmarkAllocaWithOlds/6
BenchmarkAllocaWithOlds/6-10 2411346 494.2 ns/op
BenchmarkAllocaWithOlds/5
BenchmarkAllocaWithOlds/5-10 2926117 442.5 ns/op
BenchmarkAllocaWithOlds/4
BenchmarkAllocaWithOlds/4-10 3119446 423.4 ns/op
BenchmarkAllocaWithOlds/3
BenchmarkAllocaWithOlds/3-10 3158749 354.8 ns/op
BenchmarkAllocaWithOlds/2
BenchmarkAllocaWithOlds/2-10 3334771 347.9 ns/op
BenchmarkAllocaWithOlds/1
BenchmarkAllocaWithOlds/1-10 3824101 318.6 ns/op
BenchmarkAllocaWithOlds/0
BenchmarkAllocaWithOlds/0-10 3775035 320.7 ns/op
PASS
Some conclusions can be deducted from these simplistic benchmarks. However, only basic ones and I suggest to alter your configs or code of production applications only after load testing and profiling them under real load.