More Concurrency Bugs
More Real-world Usage #
Datadog has excellent software engineers. Nevertheless, it’s easy to find code with race conditions. Let’s examine a
simplified version of
waitForConfigsFromAD
:
|
|
Side note:1 This code is used in the Agent Check Status CLI. You will not notice resource leaks there.
What Are the Issues? #
Utilizing an atomic Boolean as a “finished” indicator, coupled with a deferred goroutine to set it and subsequently draining a channel seems clever2 to me.
Unfortunately, it has a race condition. When we first test the atomic waiting
(seeing it be true), and then in
parallel exit waitForConfigsFromAD
, spawning the deferred goroutine which tries to drain the channel, we leak the
goroutine at line 38 because no one will ever read from configChan
.
Try it on the Go Playground.
An Alternative Implementation #
Let us try a synchronous approach. The original is a little more complicated, but let’s simply assume we want to collect configurations until we either:
- collected a fixed number.
- encountered a configuration error.
- canceled the passed context, for example in a time out.
Also, we are interested in the list of encountered errors. On cancelation we just return what we have so far, without signaling an error.
Simply put, we need a collector that allows us to wait until it is finished collecting according to the criteria above and ask what it has collected so far. Something like:
type Collector struct {
// ...
}
func NewCollector(discoveryMinInstances int) *Collector {
// ...
}
func (c *Collector) Schedule(configs []Config) {
// ...
}
func (c *Collector) Done() <-chan struct{} {
// ...
}
func (c *Collector) Result() ([]Config, error) {
// ...
}
Given that, we can reimplement waitForConfigsFromAD
:
|
|
Simple, synchronous code - we could even think about removing the scheduler from the component after it is done.
Now everything else falls into place:
|
|
There’s definite room for improvement here, but the key takeaway is that it can just be written down and it is easily testable.
Summary #
We replaced asynchronous code with a race condition with a synchronous, thread safe implementation. Like in previous posts, refactoring and separation of concerns helps structuring our tasks and avoid errors.
-
This blog is not intended to assign blame or irresponsibly publish bugs. We are here to learn. ↩︎
-
“Clear is better than clever”. Rob Pike. 2015. Go Proverbs with Rob Pike — Gopherfest 2015 — <golang.org/doc/effective_go.html#sharing> ↩︎