Resource Leaks

21 March 2024

… continued from the previous post.

Problems With Concurrency

In the first post of this series I was citing two papers that examined common bugs in handling of Go concurrency.

We took a long time to arrive here, but I wanted to show that goroutines are not the panacea especially beginners take them for. Understanding concurrent code is sometimes hard, even when it’s easy to write.

One thing I see often is leaking goroutines when handling errors:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
package main

import (
	"errors"
	"fmt"
	"sync"
)

func MergeContributors(primaryAccount, secondaryAccount Account) error {
	// Create a WaitGroup to manage the goroutines.
	var waitGroup sync.WaitGroup
	c := make(chan error)

	// Perform 3 concurrent transactions against the database.
	waitGroup.Add(3)

	go func() {
		waitGroup.Wait()
		close(c)
	}()

	// Transaction #1, merge "commit" records
	go func() {
		defer waitGroup.Done()

		err := mergeCommits(primaryAccount, secondaryAccount)
		if err != nil {
			c <- err
		}
	}()

	// Transaction #2, merge "pull request" records
	go func() {
		defer waitGroup.Done()

		err := mergePullRequests(primaryAccount, secondaryAccount)
		if err != nil {
			c <- err
		}
	}()

	// Transaction #3, merge "merge" records
	go func() {
		defer waitGroup.Done()

		err := mergePullRequestMerges(primaryAccount, secondaryAccount)
		if err != nil {
			c <- err
		}
	}()

	// This line is bad! Get rid of it!
	// waitGroup.Wait()

	for err := range c {
		if err != nil {
			return err
		}
	}

	return markMerged(primaryAccount, secondaryAccount)
}

Can you spot the leak? Try it on the Go playground.

This code is adapted from “Synchronizing Go Routines with Channels and WaitGroups”. This code was not originally written by Sophie DeBenedetto and I believe she is a capable developer. It was chosen because it is pretty typical for goroutine leaks in practice, not to blame anyone personally.

The original code didn’t test error cases and was deadlocking if one process failed. A deadlock is probably easy to spot and will be fixed.

When two ore more requests fail (which I assume is realistic, because the reason for the first failure might affect the other requests too), the first error will exit MergeContributors, the second will hang on c <- err (because no one will ever read the error channel) and the first started goroutine will hang on waitGroup.Wait().

The resulting bug is a memory and goroutine leak and is much more intricate and easily overlooked.

Overall, we found that there are around 42% blocking bugs caused by errors in protecting shared memory, and 58% are caused by errors in message passing. Considering that shared memory primitives are used more frequently than message passing ones, message passing operations are even more likely to cause blocking bugs.¹

Observation 3. Contrary to the common belief that message passing is less error-prone, more blocking bugs in our studied Go applications are caused by wrong message passing than by wrong shared memory protection.¹

Analysis

What could be considered a potential solution? One approach might use a buffered channel for errors, such as c := make(chan error, 3) (although 2 would suffice), which resolves the leak and therefore could be called a ‘fix’.

However, this introduces additional magic numbers into the code, complicating its readability and maintainability. It assumes that we know the maximal number of spawned goroutines in advance, so it is more of a shotgun debugging approach rather than a comprehensive understanding and resolution of the underlying issue.

What Would Be a Proper Fix?

Analyzing the code, we can identify five goroutines communicating:

The three worker threads, doing transactions
The first goroutine, waiting for the workers to finish
The main goroutine (running MergeContributors) waiting for all “child” goroutines to finish and either succeed or returning the first error that happened

How do they communicate? The three worker goroutines communicate with first one, the “waiter”, with a sync.WaitGroup and all four (waiter and workers) communicate with the main goroutine via the error channel.

This gives us an idea why the original approach of using the WaitGroup in the main goroutine is overly complicated and lead to a deadlock, resolved in the blog post.

The problem we can easily identify is that the main goroutine does not do what we assume, “waiting for all child goroutines to finish and either returning the first error that happened or success”. When an error occurs it abandons its duties, leaking the remaining goroutines:

53
54
55
56
57
58
59
	// waitGroup.Wait()

	for err := range c {
		if err != nil {
			return err
		}
	}

So, let’s fix that:

53
54
55
56
57
58
59
60
61
	var err error
	for e := range c {
		if err == nil {
			err = e
		}
	}
	if err != nil {
		return err
	}

This does what we expect: Using the channel close as the signal for termination and returning the first error when all goroutines are finished.

But This Is Bad Code

Maybe. It is not tricky and does what the probable intend of the original code was. Let’s make it little bit more readable by grouping the relevant code:

17
18
19
20
	go func() {
		waitGroup.Wait()
		close(c)
	}()

53
54
55
56
57
58
59
60
61
	var err error
	for e := range c {
		if err == nil {
			err = e
		}
	}
	if err != nil {
		return err
	}

What might make the code bad is that it spawns a goroutine to do nothing but wait - it translates a wait group into a channel close.

The backpressure of the result channel is fine, we just don’t like it because we don’t plan to process the remaining data (errors).

Summary

Naïve use of goroutines and synchronization primitives can introduce bugs, while not necessarily improving execution efficiency.

… continued in the next post.

Tengfei Tu, Xiaoyu Liu, Linhai Song, Yiying Zhang. 2019. Understanding Real-World Concurrency Bugs in Go. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems — April 2019 — Pages 865–878 — <doi.org/10.1145/3297858.3304069> ↩︎ ↩︎

Avoiding Unnecessary Work Structured Concurrency