What lessons did you learn from a project which nearly/actually failed due to bad multithreading?

Question

Sometimes, the framework imposes a certain threading model that makes things an order of magnitude more difficult to get right.

As for me, I have yet to recover from the last failure and I feel that it is better for me not to work on anything that has to do with multithreading in that framework.

I found that I was good at multithreading problems which have simple fork/join, and where data only travels in one direction (while signals can travel in a circular direction).

I am unable to handle GUI in which some work can only be done on a strictly-serialized thread (the "main thread") and other work can only be done on any thread but the main thread (the "worker threads"), and where data and messages have to travel in all directions between N components (a fully connected graph).

At the time when I left that project for another one, there were deadlock issues everywhere. I heard that 2-3 months later, several other developers managed to fix all of the deadlock issues, to the point that it can be shipped to customers. I never managed to find out that missing piece of knowledge I'm lacking.

Something about the project: the number of message IDs (integer values which describe the meaning of a event which can be sent into the message queue of another object, regardless of threading) runs into several thousands. Unique strings (user messages) also run into about a thousand.

Added

The best analogy I got from another team (unrelated to my past or present projects) was to "put the data in a database". ("Database" referring to centralization and atomic updates.) In a GUI that is fragmented into multiple views all running on the same "main thread" and all the non-GUI heavy-lifting is done in individual worker threads, the application's data should be stored in a single plase which acts like a Database, and let the "Database" handle all the "atomic updates" involving non-trivial data dependencies. All other parts of GUI just handle screen drawing and nothing else. The UI parts could cache stuff and the user won't notice if it's stale by a fraction of a second, if it's designed properly. This "database" is also known as "the document" in Document-View architecture. Unfortunately - no, my app actually stores all data in the Views. I don't know why it was like that.

Fellow contributors:

(contributors don't need to use real/personal examples. Lessons from anecdotal examples, if it is judged by yourself to be credible, are also welcome.)

score 13 · Answer 1 · answered May 09 '11 at 08:45

My favorite lesson – very hard won! – is that in a multithreaded program the scheduler is a sneaky swine that hates you. If things can go wrong, they will, but in an unexpected fashion. Get anything wrong, and you'll be chasing weird heisenbugs (because any instrumentation you add will change the timings and give you a different run pattern).

The only sane way to fix this is to strictly corral all the thread handling into as small a piece of code that gets it all right and which is very conservative about ensuring that locks are properly held (and with a globally constant order of acquisition too). The easiest way to do that is to not share memory (or other resources) between threads except for messaging which must be asynchronous; that lets you write everything else in a style that is thread-oblivious. (Bonus: scaling out to multiple machines in a cluster is much easier.)

score 6 · Answer 2 · answered May 09 '11 at 05:28

Here's a few basic lessons I can think of right now (not from projects failing but from real issues seen on real projects):

Try to avoid any blocking calls while holding a shared resource. Common deadlock pattern is thread grabs mutex, makes a callback, callback blocks on same mutex.
Protect access to any shared data structures with a mutex/critical section (or use lock free ones - but don't invent your own!)
Don't assume atomicity - use atomic APIs (e.g. InterlockedIncrement).
RTFM regarding thread safety of libraries, objects or APIs you're using.
Take advantage of synchonization primitives available, e.g. events, semaphores. (But pay close attention when using them that you know you're in a good state - I've seen many examples of events signalled in the wrong state such that events or data can get lost)
Assume threads can execute concurrently and/or at any order and that context may switch between threads at any time (unless under an OS that makes other guarantees).

score 6 · Answer 3 · answered May 09 '11 at 08:14

Your entire GUI project should only be called from the main thread. Basically, you shouldn't put a single (.net) "invoke" in your GUI. Multithreading should be stuck in separate projects that handle the slower data-access.

We inherited a part where the GUI project is using a dozen threads. It's giving nothing but problems. Deadlocks, racing issues, cross thread GUI calls...

score 1 · Answer 4 · answered May 09 '11 at 06:09

Java 5 and later has Executors which are intended to make life easier for handling multi-threading fork-join style programs.

Use those, it will remove a lot of the pain.

(and, yes, this I learned from a project :) )

score 1 · Answer 5 · edited May 09 '11 at 21:02

I have a background in hard realtime embedded systems. You can't test for the absence of problems caused by multithreading. (You can sometimes confirm the presence). Code has to be provably correct. So best practice around any and all thread interaction.

#1 rule: KISS - If does not need a thread, don't spin one. Serialise as much as possible.
#2 rule: Don't break #1.
#3 If you can not prove through review it's correct, its not.

score 1 · Answer 6 · answered May 09 '11 at 21:55

An analogy from a class on multithreading I took last year was very helpful. Thread synchronization is like a traffic signal protecting an intersection (data) from being used by two cars (threads) at once. The mistake a lot of developers make is turning lights red across most of the city to let one car through because they think it's too hard or dangerous to figure out the exact signal they need. That might work well when traffic is light, but will lead to gridlock as your application grows.

That's something I already knew in theory, but after that class the analogy really stuck with me, and I was amazed how often after that I would investigate a threading issue and find one giant queue, or interrupts being disabled everywhere during a write to a variable only two threads used, or mutexes being held a long time when it could be refactored to avoid it altogether.

In other words, some of the worst threading issues are caused by overkill trying to avoid threading issues.

score 0 · Answer 7 · answered May 09 '11 at 03:44

Try doing it again.

At least for me, what created a difference was practice. After doing multi threaded and distributed work quite a few times you just get the hang of it.

I think debugging is really what makes it difficult. I can debug multi threaded code using VS but I'm really at a complete loss if I have to use gdb. My fault, probably.

Another thing that is learning more about is lock free data structures.

I think this question can be really improved if you specify the framework. .NET thread pools and background workers are really different than QThread, for an example. There's always a few platform specific gotchas.

score 0 · Answer 8 · answered May 09 '11 at 04:31

0

I've learned that callbacks from lower level modules to higher level modules are a huge evil because they cause acquiring locks in an opposite order.

answered May 09 '11 at 04:31

Sergej Zagursky

109

What lessons did you learn from a project which nearly/actually failed due to bad multithreading?

8 Answers8