7

I just spent a long, miserable week debugging a stack-overflow in a C++/Qt application. The fundamental problem was that I had a function that accepted a callback, and in certain cases the callback was triggered (in another function) before the original function returned. In this case, the callback happened to cause an identical callback to be registered before the callback itself completed.

The solution, since I'm using the Qt event loop, was to schedule the callback to be triggered by a single-shot QTimer, rather than calling it directly. (For those who haven't used Qt, a single-shot QTimer with a 0-millisecond timeout is just a way to ensure that a unit of work is triggered by the event-loop; it's roughly equivalent to Boost-Asio's io_service::post.) Alternatively, the callback itself could have scheduled the callback-registration function instead of calling it directly.

Is this a known problem, and is there a standard way to deal with it?

There are several possible best-practice guidelines I can think of that might help:

  • All callbacks must either be obviously short-lived (e.g. setting flags and then returning) or schedule a "long" callback to be triggered directly by the program's main event loop.
    • This has the obvious drawback of requiring clients to use a main event loop, and it makes callbacks somewhat more complicated. It also places seemingly arbitrary restrictions on how callbacks can be structured.
  • Conversely: all functions that take a callback must schedule their real "work" in via the event loop.
    • This has the same drawback as above, but on the other side of the client relationship.
  • Functions that trigger callbacks must only be called via the event loop.
    • This seems like a simpler rule, but I'm not sure how easy it would be to follow in practice.
  • Callbacks themselves can only be called via the event-loop.
    • This is slightly stronger than the above principle, and possibly easier to enforce. It would be something like a "purely-event-driven" architecture.

All of these have pretty obvious drawbacks, and I can't think of any strategies that would work without some way of postponing callbacks via the application's main loop, so in effect I don't see a safe way to use callbacks in a long-running application without using something like the Qt event loop.


As per request, here's some Python-esque pseudocode to explain the particular problem I had:

class TransactionManager:
    def newTransaction(Callback cb, ...):
        if (can schedule transaction) { scheduleTransaction(cb, ...); }
        else { cb(error); }
    def scheduleTransaction(Callback cb, ...):
        ... // set up the transaction itself and register the callback

class TaskManager:
    def newShortTask(Callback cb, ...):
        // In the original code, an intermediary `ShortTask` object
        // was created, and the `ShortTask` created a separate callback
        // to do some other work in addition to calling `cb`. That's not
        // pertinent to the issue at hand, so it's not included in the
        // pseudocode.
        myTransactionManager.newTransaction(cb, .... );

class LongTask:
    def start():
        myTaskManager.newShortTask(
                lambda (...): self.handleSubtaskFinished(...),
                ....);
    def handleSubtaskFinished(...):
        if (task failed):
            start();  // try again
        else:
            ... // continue with task

The problem was that the success of a "transaction" is partially determined by the state of the hardware, and I was testing what happens when the hardware is (temporarily) unable to perform any of the requested transaction. The order of operations was therefore:

  • LongTask::start() is called.
  • LongTask::start starts a new "short" task via the TaskManager
  • The new "short" task schedules a new transaction via the TransactionManager
  • The short task immediately fails, because the hardware is in a bad state. Before TransactionManager::newTransaction() returns, the callback cb is called.
  • LongTask::handleSubtaskFinished() sees that its subtask has failed, so it immediately tries again, going back to the top of this list without unwinding the stack.

NOTE that in this case the "infinite loop until the hardware fixes itself" behavior is correct; the problem is that the application must ensure it doesn't run out of stack-space while waiting for the hardware failure to be corrected.

2 Answers2

1

I think the problem is the same one as with any unintendend infinite recursion: You are not perfectly sure what the callee does when you call it, and if that callee happens to call the caller with unchanged arguments, you are doomed. I really don't think that your case is any different, only the involved mechanics differ.

So, I would apply the same to your situation as to recursion in general: It's not a good idea to forbid it in general since any such scheme will a) forbid a lot of legit code to avoid undesirable behavior in some special cases, and b) be insufficient to avoid undesirable behavior unless it's as strict as a tight-jacket. Instead, infinite recursions should be avoided by thinking about your function calls, and likewise infinite callback recursions should be avoided by thinking about your callback registrations.

0

I'm not 100% sure of the problem you had but it seems like you had issues with callbacks calling callbacks, or at least callbacks being called out of sequence.

Why that was occurring seems to be that you are combining a callback-based architecture with an event based one, so you don't have full control over your program flow.

A simple answer would be to say use one or the other, and if you have to mix them, only do it in very specific cases with great care.

gbjbaanb
  • 48,749
  • 7
  • 106
  • 173