More Concurrency Bugs
More Real-world Usage
Datadog has excellent software engineers. Nevertheless, it’s easy to find code with race conditions. Let’s examine a
simplified version of
waitForConfigsFromAD
:
|
|
Side note:1 This code is used in the Agent Check Status CLI. You will not notice resource leaks there.
What Are the Issues?
Utilizing an atomic Boolean as a “finished” indicator, coupled with a deferred goroutine to set it and subsequently draining a channel seems clever2 to me.
Unfortunately, it has a race condition. When we first test the atomic waiting
(seeing it be true), and then in
parallel exit waitForConfigsFromAD
, spawning the deferred goroutine which tries to drain the channel, we leak the
goroutine at line 38 because no one will ever read from configChan
.
Try it on the Go Playground.
An Alternative Implementation
Let us try a synchronous approach. The original is a little more complicated, but let’s simply assume we want to collect configurations until we either:
- collected a fixed number.
- encountered a configuration error.
- canceled the passed context, for example in a time out.
Also, we are interested in the list of encountered errors. On cancelation we just return what we have so far, without signaling an error.
Simply put, we need a collector that allows us to wait until it is finished collecting according to the criteria above and ask what it has collected so far. Something like:
type Collector struct {
// ...
}
func NewCollector(discoveryMinInstances int) *Collector {
// ...
}
func (c *Collector) Schedule(configs []Config) {
// ...
}
func (c *Collector) Done() <-chan struct{} {
// ...
}
func (c *Collector) Result() ([]Config, error) {
// ...
}
Given that, we can reimplement waitForConfigsFromAD
:
|
|
Simple, synchronous code - we could even think about removing the scheduler from the component after it is done.
Now everything else falls into place:
|
|
There’s definite room for improvement here, but the key takeaway is that it can just be written down and it is easily testable.
Summary
We replaced asynchronous code with a race condition with a synchronous, thread safe implementation. Like in previous posts, refactoring and separation of concerns helps structuring our tasks and avoid errors.
This blog is not intended to assign blame or irresponsibly publish bugs. We are here to learn. ↩︎
“Clear is better than clever”. Rob Pike. 2015. Go Proverbs with Rob Pike — Gopherfest 2015 — <golang.org/doc/effective_go.html#sharing> ↩︎