In the previous post I’ve tried to measure overhead from highly contended access to sync.Pool on go. There was some measurable overhead, but it was mild. However in order for another goroutine to recieve some work to be done or/and to publish work results you may need go channel also. In this post I’ll try to measure the overhead of contended access to go channel.
So I’ve written the following code
package main
import (
"runtime"
"strconv"
"sync/atomic"
"testing"
)
const workSize = 1000
func workWithChan(ch chan int) {
for i := 0; i < workSize; i++ {
select {
case ch <- i:
default:
panic("chan is full")
}
}
for i := 0; i < workSize; i++ {
select {
case <-ch:
default:
panic("chan is empty")
}
}
}
func Benchmark_Chan(b *testing.B) {
for _, parallelism := range []int{1, 2, 3, 4, 5, 10, 20, 50, 100} {
b.Run("Parallelism "+strconv.Itoa(parallelism), func(b *testing.B) {
b.Run("one chan", func(b *testing.B) {
b.SetParallelism(parallelism)
ch := make(chan int, workSize*parallelism*runtime.GOMAXPROCS(0))
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
workWithChan(ch)
}
})
})
b.Run("chan per core", func(b *testing.B) {
b.SetParallelism(parallelism)
numChans := runtime.GOMAXPROCS(0)
chans := make([]chan int, numChans)
for i := 0; i < numChans; i++ {
chans[i] = make(chan int, workSize*parallelism)
}
procs := uint32(0)
b.RunParallel(func(pb *testing.PB) {
chanidx := atomic.LoadUint32(&procs)
for {
if atomic.CompareAndSwapUint32(&procs, chanidx, chanidx+1) {
break
}
chanidx = atomic.LoadUint32(&procs)
}
chanidx = chanidx / uint32(parallelism)
for pb.Next() {
workWithChan(chans[chanidx])
}
})
})
b.Run("chan per goroutine", func(b *testing.B) {
b.SetParallelism(parallelism)
numChans := runtime.GOMAXPROCS(0) * parallelism
chans := make([]chan int, numChans)
for i := 0; i < numChans; i++ {
chans[i] = make(chan int, workSize)
}
procs := uint32(0)
b.RunParallel(func(pb *testing.PB) {
chanidx := atomic.LoadUint32(&procs)
for {
if atomic.CompareAndSwapUint32(&procs, chanidx, chanidx+1) {
break
}
chanidx = atomic.LoadUint32(&procs)
}
for pb.Next() {
workWithChan(chans[chanidx])
}
})
})
})
}
}
Here we have 3 variants - one pool shared between all the goroutines, one pool per each goroutine and one pool per each CPU core while additional parameter parallelism is equal to number of goroutines per CPU core.
Here are the results with some omissions and reformatting to simplify reading.
Benchmark_Chan/Parallelism_1/one_chan 60868 ns/op
Benchmark_Chan/Parallelism_1/chan_per_core 3594 ns/op
Benchmark_Chan/Parallelism_1/chan_per_goroutine 3584 ns/op
Benchmark_Chan/Parallelism_2/one_chan 61708 ns/op
Benchmark_Chan/Parallelism_2/chan_per_core 4758 ns/op
Benchmark_Chan/Parallelism_2/chan_per_goroutine 3328 ns/op
Benchmark_Chan/Parallelism_3/one_chan 62023 ns/op
Benchmark_Chan/Parallelism_3/chan_per_core 5119 ns/op
Benchmark_Chan/Parallelism_3/chan_per_goroutine 3243 ns/op
Benchmark_Chan/Parallelism_4/one_chan 63873 ns/op
Benchmark_Chan/Parallelism_4/chan_per_core 5343 ns/op
Benchmark_Chan/Parallelism_4/chan_per_goroutine 3375 ns/op
Benchmark_Chan/Parallelism_5/one_chan 63416 ns/op
Benchmark_Chan/Parallelism_5/chan_per_core 5749 ns/op
Benchmark_Chan/Parallelism_5/chan_per_goroutine 3319 ns/op
Benchmark_Chan/Parallelism_10/one_chan 63559 ns/op
Benchmark_Chan/Parallelism_10/chan_per_core 6549 ns/op
Benchmark_Chan/Parallelism_10/chan_per_goroutine 3243 ns/op
Benchmark_Chan/Parallelism_20/one_chan 64031 ns/op
Benchmark_Chan/Parallelism_20/chan_per_core 7948 ns/op
Benchmark_Chan/Parallelism_20/chan_per_goroutine 3230 ns/op
Benchmark_Chan/Parallelism_50/one_chan 64367 ns/op
Benchmark_Chan/Parallelism_50/chan_per_core 14881 ns/op
Benchmark_Chan/Parallelism_50/chan_per_goroutine 3285 ns/op
Benchmark_Chan/Parallelism_100/one_chan 64066 ns/op
Benchmark_Chan/Parallelism_100/chan_per_core 30397 ns/op
Benchmark_Chan/Parallelism_100/chan_per_goroutine 3273 ns/op
So with single channel for all goroutines we immediately recieve enourmous contention that is nearly equal with any number of goroutines per CPU core.
With separate channel per each goroutine there is no contention obviously and no overhead from it.
With a channel per CPU core we start seeing the larger overhead the more goroutines we launch per CPU core.
Together with the previous post I can make an assumption that contention on some single sync.Pool may not become a bottleneck in a real world application, while contention on a single channel may cause some degradation.