Skip to main content

Goroutines Are Cheap, but Not Free

·5 mins

Concurrency Is Not Parallelism1 #

Recently two interesting observations in “A Study of Real-World Data Races in Golang”2 caught my eye:

Observation 1. Developers using Go employ significantly more concurrency and synchronization constructs than in Java.

Observation 2. Developers using Go for programming microservices expose significantly more runtime concurrency than other languages such as Java, Python, and NodeJS used for the same purpose.

These observations agree with my experiences, and while easy concurrency is great it comes with its problems.3

If you are familiar with the intricacies of the Go scheduler jump to part four, otherwise let’s just take a step back:

Building Blocks #

Concurrency in Go builds upon

  • Goroutines
  • Channels
  • Cancelation (in package context)
  • Function literals (closures, anonymous functions)
  • Synchronization primitives (in package sync)
  • Result and error handling (or more generally communication between goroutines)

And while it can be argued that some of these points are not exactly part of Go’s program execution model, all of these elements are identifiable in code dealing with concurrency.

A Simple Example #

To start our journey, let us take a simple example: We want to calculate the 27. Fibonacci number (196,418), using a simple approach:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
package fibonacci

func Slow(i int) int {
	if i < 2 {
		return i
	}

	fn1 := Slow(i - 1)
	fn2 := Slow(i - 2)

	return fn1 + fn2
}

And call this function 1,000 times:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
package main

import "fillmore-labs.com/blog/goroutines/pkg/fibonacci"

func main() {
	for range 1_000 {
		// queryStart := time.Now()
		_ = fibonacci.Slow(27)
		// duration := time.Since(queryStart)
	}
}

This is an easily understandable stand-in for a CPU-bound algorithm, so please do not send me better implementations - it is meant to burn CPU cycles.

Running this on a N5105 CPU gives us:

> go run fillmore-labs.com/blog/goroutines/cmd/try1
*** Finished 1000 runs in 1.47s - avg 1.47ms, stddev 18.4µs

So, our whole program takes 1.47 seconds on a single core. This is okay, but since the N5105 has four cores we can do better.

Let’s parallelize (yes, I know) this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
package fibonacci

func Parallel1(i int) int {
	if i < 2 {
		return i
	}

	fc1 := make(chan int)
	go func() { fc1 <- Parallel1(i - 1) }()

	fc2 := make(chan int)
	go func() { fc2 <- Parallel1(i - 2) }()

	return <-fc1 + <-fc2
}

Ok, great. Off we go:

> go run fillmore-labs.com/blog/goroutines/cmd/try2
*** Finished 1000 runs in 5m25s - avg 325ms, stddev 6.49ms

This is pretty terrible. It takes more than 200 times as long as before while using all available cores.

One problem is easy to spot: We are creating two new goroutines and use the original one just to wait rather than do any meaningful work. Let us fix that:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
package fibonacci

func Parallel2(i int) int {
	if i < 2 {
		return i
	}

	fc1 := make(chan int)
	go func() { fc1 <- Parallel2(i - 1) }()

	fn2 := Parallel2(i - 2)

	return <-fc1 + fn2
}

Another try, then:

> go run fillmore-labs.com/blog/goroutines/cmd/try3
*** Finished 1000 runs in 1m51s - avg 111ms, stddev 3.10ms

Ok, that was an easy three times speed up. Much better, but we are still much slower than the single-core solution.

No One Writes Code Like That #

Oh? containerd has a ‘concurrent’ garbage collector that spawns exponentially many goroutines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
package gc

import (
	"sync"
)

func ConcurrentMark(root Node) map[Node]struct{} {
	grays := make(chan Node)
	seen := map[Node]struct{}{}

	var wg sync.WaitGroup
	go func() {
		for gray := range grays {
			if _, ok := seen[gray]; ok {
				wg.Done()

				continue
			}
			seen[gray] = struct{}{}

			go func() {
				defer wg.Done()

				var children []Node = gray.Children()
				for _, n := range children {
					wg.Add(1)
					grays <- n
				}
			}()
		}
	}()

	wg.Add(1)
	grays <- root

	wg.Wait()
	close(grays)

	return seen
}

This is some clever piece of code. The access to seen is serialized through the grays channel and every Node posted to grays increments the wait group and spawns a new goroutine.

Simply constructing a complete four-ary tree of height nine and comparing with a simple non-concurrent solution gives us the following result:

> go run fillmore-labs.com/blog/goroutines/cmd/gc
Concurrent: Found 349525 reachable nodes in 462ms
Non-Concurrent: Found 349525 reachable nodes in 166ms

Should containerd change its solution? No. This code is seven years old and seems to be unused. The constructed test tree is pathologic and the results will be different in practice due to many already seen nodes. Also calculating children might be CPU intensive.

The point I am making is that the presented kind of code exists in practice and while the go scheduler is good and forgives many mistakes, it is still often not used properly.

Summary #

Use of goroutines can make execution slower.

Especially interesting is the comparison of Parallel1 and Parallel2. While both are bad solutions, I have often seen the construct of having a goroutine exclusively for waiting instead of doing actual work, and it makes a dramatic difference in this case.

… continued in part two.


  1. Rob Pike. 2012. Concurrency is not Parallelism. Talk at Heroku Waza conference — January 2012 — <vimeo.com/49718712> — <go.dev/talks/2012/waza.slide↩︎

  2. Milind Chabbi and Murali Krishna Ramanathan. 2022. A Study of Real-World Data Races in Golang. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation — June 2022 — Pages 474–489 — <doi.org/10.1145/3519939.3523720↩︎

  3. Tengfei Tu, Xiaoyu Liu, Linhai Song, Yiying Zhang. 2019. Understanding Real-World Concurrency Bugs in Go. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems — April 2019 — Pages 865–878 — <doi.org/10.1145/3297858.3304069↩︎