Recently my colleague suggested that contention on a single sync.Pool in Go may hit performance. I decided to reproduce this performance hit in a benchmark. This is what I was capable of writing.

import (
	"runtime"
	"strconv"
	"sync"
	"sync/atomic"
	"testing"
)

func workWithPool(pool *sync.Pool) {
	const size = 1000
	values := [size]int{}
	for i := 0; i < size; i++ {
		values[i] = pool.Get().(int)
	}
	for i := 0; i < size; i++ {
		pool.Put(values[i])
	}
}

func Benchmark_Pool(b *testing.B) {
	for _, parallelism := range []int{1, 2, 3, 4, 5, 10, 20, 50, 100} {
		b.Run("Parallelism "+strconv.Itoa(parallelism), func(b *testing.B) {

			b.Run("one pool", func(b *testing.B) {
				b.SetParallelism(parallelism)
				pool := &sync.Pool{
					New: func() any {
						return 42
					},
				}
				b.RunParallel(func(pb *testing.PB) {
					for pb.Next() {
						workWithPool(pool)
					}
				})
			})
			b.Run("many pools", func(b *testing.B) {
				b.SetParallelism(parallelism)
				numPools := runtime.GOMAXPROCS(0) * parallelism
				pools := make([]*sync.Pool, numPools)
				for i := 0; i < numPools; i++ {
					pools[i] = &sync.Pool{
						New: func() any {
							return 42
						},
					}
				}

				procs := uint32(0)
				b.RunParallel(func(pb *testing.PB) {
					poolidx := atomic.LoadUint32(&procs)
					for {
						if atomic.CompareAndSwapUint32(&procs, poolidx, poolidx+1) {
							break
						}
						poolidx = atomic.LoadUint32(&procs)
					}
					for pb.Next() {
						workWithPool(pools[poolidx])
					}
				})
			})
		})
	}
}

It seems that I need to comment a bit on the parallelism parameter. It stands for number of goroutines per each thread, while number of threads by default is equal to number of CPU cores. In my case I have 10 cores and thus 10 threads. Here are the results with some omissions and reformatting to simplify reading.

Benchmark_Pool/Parallelism_1/one_pool         	  	      4513 ns/op
Benchmark_Pool/Parallelism_1/many_pools       	  	      2364 ns/op

Benchmark_Pool/Parallelism_2/one_pool         	  	      3078 ns/op
Benchmark_Pool/Parallelism_2/many_pools       	  	      2273 ns/op

Benchmark_Pool/Parallelism_3/one_pool         	  	      3144 ns/op
Benchmark_Pool/Parallelism_3/many_pools       	  	      2272 ns/op

Benchmark_Pool/Parallelism_4/one_pool         	  	      2979 ns/op
Benchmark_Pool/Parallelism_4/many_pools       	  	      2315 ns/op

Benchmark_Pool/Parallelism_5/one_pool         	  	      3355 ns/op
Benchmark_Pool/Parallelism_5/many_pools       	  	      2314 ns/op

Benchmark_Pool/Parallelism_10/one_pool        	  	      4948 ns/op
Benchmark_Pool/Parallelism_10/many_pools      	  	      2269 ns/op

Benchmark_Pool/Parallelism_20/one_pool        	  	      6979 ns/op
Benchmark_Pool/Parallelism_20/many_pools      	  	      2288 ns/op

Benchmark_Pool/Parallelism_50/one_pool        	  	      8324 ns/op
Benchmark_Pool/Parallelism_50/many_pools      	  	      2371 ns/op

Benchmark_Pool/Parallelism_100/one_pool       	  	     10044 ns/op
Benchmark_Pool/Parallelism_100/many_pools     	  	      2295 ns/op

It seems that with only 1 goroutine per CPU core this code for some reason suffers from contention more than with 2-5 goroutines. However in order to observe noticeable penalty we need 10+ goroutines per core. Keep in mind also that every operation here invokes Get on sync.Pool 1000 times and Put 1000 times also.

After writing and running this benchmark I conclude that there is no real danger in contention on sync.Pool and it may be used for really common low-level things.