Concurrency Bugs

Errors Observed in Real-world Usage #

In “ Problems With Concurrency” I mentioned that I see concurrency issues a lot. Let’s look at something I recently found in production:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57


package main

func task() ([]int, error) {
	// We want to calculate all results in parallel. So we create a wait group with 3 elements. We will call wg.Done()
	// after each goroutine is finished. We also create a mutex to lock the result while writing to it. We also create
	// an error channel to collect errors from the goroutines.
	var wg sync.WaitGroup
	var mu sync.Mutex
	errs := make(chan error)

	// Create result
	result := make([]int ,3)

	// We create a wait group with 3 elements. We will call wg.Done() after each goroutine is finished.
	wg.Add(3)
	// We start all goroutines.
	go setResult(result, 0, &wg, &mu, errs)
	go setResult(result, 1, &wg, &mu, errs)
	go setResult(result, 2, &wg, &mu, errs)
	wg.Wait()

	// Handle errors
	close(errs)
	for err := range errs {
		return nil, err
	}

	return result, nil
}

func setResult(result []int, i int, wg *sync.WaitGroup, mu *sync.Mutex, errs chan<- error) {
	defer wg.Done()
	r, err := calculate(i)
	if err != nil {
		errs <- err

		return
	}
	mu.Lock()
	result[i] = r
	mu.Unlock()
}

func calculate(i int) (int, error) {
	switch i {
	case 0:
		return 1, nil
	case 11:
		return 2, nil
	case 2:
		return 3, nil
	}

	return 0, errTest
}

var errTest = errors.New("test")

Try it on the Go Playground.

This code is pretty similar to the one in “ Problems With Concurrency” before it was modified in the blog. It makes nearly that same mistake and deadlocks on error conditions. Obviously¹, the fix is not to make this function more complex and use errs := make(chan error, 3).

What Are the Issues? #

This code very nicely demonstrates what you get when you start with concurrent code without thinking about the design.

Let first fix it, then analyze its problems. The task is divided into three independent calculations, where failure in any one of them results in the failure of the entire task. The synchronous version would be:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


func task() ([]int, error) {
	const results = 3
	result := make([]int, results)
	var err error

	for i := 0; i < results; i++ {
		result[i], err = calculate(i)
		if err != nil {
			break
		}
	}

	return result, err
}

Where we see immediately that we don’t need the mutex, since the results don’t overlap. Transforming into a parallel version gives us:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


import "golang.org/x/sync/errgroup"

func task() ([]int, error) {
	const results = 3
	result := make([]int, results)
	var g errgroup.Group

	for i := 0; i < results; i++ {
		r := &result[i] // Added for clarity
		g.Go(func() (err error) {
			*r, err = calculate(i)

			return err
		})
	}

	return result, g.Wait()
}

Given that this task appears straightforward, why are there bugs in the production code?

The comments provide valuable insights, beginning with the directive to “calculate … in parallel”, but without considering the task’s purpose and communication methods. Consequently, it establishes … as synchronization points, along with … and …

The code wasn’t initially designed with correctness as the primary concern. Instead, it began as an asynchronous version, employing go as a substitute for subroutine calls, with fixes being implemented reactively as problems arose. Regrettably, this pattern is frequently observed among junior Go developers.

Summary #

In designing asynchronous programs, it is often better to begin with a synchronous, correct version. This initial approach might even suffice in terms of performance, with parallelism introduced by a calling function, such as operating within one of multiple web requests. Additionally, it’s worth noting that concurrency doesn’t always need to be that extremely fine-grained, especially considering that the number of CPUs in a machine is limited.

Moreover, channels and synchronization points should be purposefully integrated, not employed as mere necessities. The need for error checking shouldn’t be retrospectively fixed by a channel with slapped-on synchronization primitives.

As mentioned before, addition of magic numbers reduces maintainability. ↩︎