Should I take care of race conditions which almost certainly has no chance of occuring?

Question

Let's consider something like a GUI application where main thread is updating the UI almost instantaneously, and some other thread is polling data over the network or something that is guaranteed to take 5-10 seconds to finish the job.

I've received many different answers for this, but some people say that if it is a race condition of a statistical impossibility, don't worry about it at all but others have said that if there's even a 10^-53% (I kid you not on the numbers, this is what I've heard) of some voodoo magic happening due to race condition, always obtain/release locks on the thread that needs it.

What are your thoughts? Is it a good programming practice to handle race condition in such statistically-impossible situations? or would it be totally unnecessary or even counterproductive to add more lines of code to hinder readability?

score 137 · Accepted Answer · answered Aug 17 '12 at 03:47

If it is truly a 1 in 10^55 event, there would be no need to code for it. That would imply that if you did the operation 1 million times a second, you'd get one bug every 3 * 10^41 years which is, roughly, 10^31 times the age of the universe. If your application has an error only once in every trillion trillion billion ages of the universe, that's probably reliable enough.

However, I would wager very heavily that the error is nowhere near that unlikely. If you can conceive of the error, it is almost certain that it will occur at least occasionally thus making it worth coding correctly to begin with. Plus, if you code the threads correctly at the outset so that they obtain and release locks appropriately, the code is much more maintainable in the future. You don't have to worry when you're making a change that you have to re-analyze all the potential race conditions, re-compute their probabilities, and assure yourself that they won't recur.

score 70 · Answer 2 · answered Aug 17 '12 at 03:50

From the cost-benefit standpoint, you should write additional code only when it gets you enough benefit.

For example, if the worst thing that would happen if a wrong thread "wins the race" is that the information would not display, and the user would need to click "refresh", don't bother guarding against the race condition: having to write a lot of code is not worth fixing something that insignificant.

On the other hand, if the race condition could result in incorrect money transfers between banking accounts, then you must guard against race condition no matter how much code you need to write to solve this problem.

score 45 · Answer 3 · answered Aug 17 '12 at 04:37

Finding a race condition is the hard part. You probably spent almost as much time writing this question as it would have taken you to fix it. It's not like it makes it that much less readable. Programmers expect to see synchronization code in such situations, and actually might waste more time wondering why it's not there and if adding it would fix their unrelated bug.

As far as probabilities are concerned, you would be surprised. I had a race condition bug report last year that I couldn't reproduce with thousands of automated tries, but one system of one customer saw it all the time. The business value of spending 5 minutes to fix it now, versus possibly troubleshooting an "impossible" bug at a customer's installation, makes the choice a no-brainer.

score 27 · Answer 4 · answered Aug 17 '12 at 03:53

27

Obtain and release the locks. Probabilities change, algorithms change. It's a bad habit to get into, and when something goes wrong you don't have to stop and wonder whether you got the odds wrong...

answered Aug 17 '12 at 03:53

jmoreno

11,238

score 13 · Answer 5 · answered Aug 17 '12 at 13:12

and some other thread is polling data over the network or something that is guaranteed to take 5-10 seconds to finish the job.

Until someone introduces a caching layer to improve performance. Suddenly that other tread finished near instantaneous and the race condition manifests more often than not.

Had exactly this happen a few weeks ago, took about 2 full developer days to find the bug.

Always fix race conditions if you recognize them.

score 8 · Answer 6 · answered Aug 17 '12 at 04:46

Simple vs correct.

In many cases, simplicity trumps correctness. It's a cost issue.

Also, race conditions are nasty things that tend not to obey simple statistics. Everything goes fine until some other seemingly unrelated synchronization causes your race condition to suddenly happen half the time. Unless you turn the logs on or debug the code of course.

A pragmatic alternative to preventing a race condition (which can be tricky) can be to detect and log it (bonus for failing hard and early). If it never happens, you lost little. If it does actually happen, you got a solid justification to spend the extra time fixing it.

score 7 · Answer 7 · answered Aug 17 '12 at 11:30

If your race-condition is security-related, you should always code to prevent it.

A common example are race conditions with creating/opening files in unix, which can in some circumstances lead to privilege escalation attacks if the program with the race condition is running with higher privileges than the user interacting with it, such as a system daemon process or worse still, the kernel.

Even if a race condition has something like 10^(-80) chance of happening randomly, it may well be the case that a determined attacker has a decent chance of creating such conditions deliberately and artificially.

score 6 · Answer 8 · answered Aug 21 '12 at 03:17

Therac-25!

Developers on the Therac-25 project were pretty confident about the timing between a UI and an interface related issue in an therapeutic XRAY machine.

They should not have been.

You can learn more about this famous life-and-death software disaster at:

http://www.youtube.com/watch?v=izGSOsAGIVQ

or

http://en.wikipedia.org/wiki/Therac-25

Your application may be much less sensitive to failure than medical devices. A helpful method is to rate risk exposure as the product of the likelihood of occurrence and the cost of occurrence over the life of the product for all the units that could be produced.

If you have chosen to build your code to last (and it sounds like you have), you should consider Moore's law that can easily lop off several zeros every few years as computers inside or outside your system get faster. If you ship thousands of copies, lop off more zeros. If users do this operation daily (or monthly) for years, take away a few more. If it is used where Google fiber is available, what then? If the UI garbage collects mid GUI operation, does that affect the race? Are you using an Open Source or Windows library behind your GUI? Can updates there affect timing?

Semaphores, locks, mutexes, barrier synchronization are among the ways to synchronize activities between threads. Potentially if you are not using them, another person who maintains your program might and then pretty quickly assumptions about relationships between threads can shift and the calculation about the race condition might be invalidated.

I recommend that you explicitly synchronize because while you might not ever see it create a problem, a customer might. In addition, even if your race condition never occurs, what if you or your organization are called to court to defend your code (as Toyota was related to the Prius a few years ago). The more thorough your methodology, the better you will fare. It might be nicer to say "we guard against this unlikely case like this..." than to say, "we know our code will fail, but we wrote down this equation to show it won't happen in our lifetime. Probably."

It sounds like the probability calculation comes from someone else. Do they know your code and do you know them enough to trust that no error was made? If I calculated a 99.99997% reliability for something, I might also think back to my college statistics classes and remember that I did not always get 100%, and back off quite a few percent on my own personal reliability estimates.

score 4 · Answer 9 · answered Aug 17 '12 at 14:26

would it be totally unnecessary or even counterproductive to add more lines of code to hinder readability?

Simplicity is only good when it's also correct. Since this code is not correct, future programmers will inevitably look at it when looking for a related bug.

Whichever way you handle it (either by logging it, documenting it, or adding the locks -- this depends on the cost), you will save other programmers time when looking at the code.

score 3 · Answer 10 · answered Aug 17 '12 at 03:54

This would depend on the context. If its a casual iPhone game, probably not. The flight control system for the next manned space vehicle, probably. It all depends on what the consequences are if the 'bad' result happens measured against the estimated cost of fixing it.

There is rarely a 'one size fits all' answer for these types of questions because they are not programming questions, but instead economics questions.

score 3 · Answer 11 · answered Aug 17 '12 at 15:09

Yes, expect the unexpected. I have spent hours (in other peoples code ^^) tracking down conditions that should never happen.

Things such as always have an else, always have a default on case, initialize variables (yes, really.. bugs happen from this), check your loops for reused variables for each iteration, etc.

If you are worried about threading issues specifically, read blogs, articles, and books on the subject. The current theme seems to be immutable data.

score 3 · Answer 12 · answered Aug 17 '12 at 15:48

Just fix it.

I've seen exactly this. One thread manages to make a network request to a server which does a complex database lookup and respond before the other thread has got to the next line of code. It happens.

Some customer somewhere will decide one day to run something that hogs all the CPU time for the "fast" thread while leaving the slow thread running, and you'll be sorry :)

Mark Hurd · Answer 13 · 2012-08-22T07:01:43.893

1

If you've recognised an unlikely race condition, at least document it in the code!

EDIT: I should add that I'd fix it if at all possible, but at the time of writing the above no other answer explicitly said at least document the problem in the code.

edited Aug 22 '12 at 07:01

answered Aug 17 '12 at 19:41

Mark Hurd

343

score 0 · Answer 14 · answered Aug 17 '12 at 13:07

0

I think that if yo already know how and why it could happen, might as well deal with it. That is if it doesn't take up an copious amount of resources.

answered Aug 17 '12 at 13:07

Sjaak van der Heide

149
5

score 0 · Answer 15 · edited Aug 18 '12 at 00:34

It all depends on what the consequences of a race condition is. I think the people answering your question are correct for their line of work. Mine is router configuration engines. For me, race conditions either makes systems stand still, corrupt or unconfigured even though it said it was successful. I always use semaphores per router so that I don't have to clean anything up by hand.

I think some of my GUI code still is prone for race conditions in such way that a user might be given an error because a race condition happened, but I would not have any such possibilities if there is a chance of data corruption or misbehaviour of the application after such event.

score 0 · Answer 16 · answered Aug 18 '12 at 11:01

Funnily enough, I encountered this problem recently. I didn't even realise a race condition was possible in my circumstance. The race condition only presented itself when multi-core processors became the norm.

The scenario was roughly like this. A device driver raised events for the software to handle. Control had to return to the device driver as soon as possible to prevent a timeout on the device. To ensure this, the event was recorded and queued in a separate thread.

Receive event from device:
{
    Record event details.
    Enqueue event in the queuing thread.
    Acknowledge the event.
}

Queueing thread receives an event:
{
    Retrieve event details.
    Process event.
    Send next command to device.
}

This worked fine for years. Then suddenly it would fail in certain configurations. It turns out that the queueing thread was now running truly in parallel to the event handling thread, rather than sharing a single processor's time. It managed to send the next command to the device before the event had been acknowledged, causing an out-of-sequence error.

Given it only affected one customer in one configuration, I shamefully put a Thread.Sleep(1000) in where the problem was. There's not been a problem since.

Should I take care of race conditions which almost certainly has no chance of occuring?

16 Answers16