2

I am having difficulties on understanding when I should be worried about TOCTOU vulnerabilities and how to avoid them because yes, we can use database transactions but there are different level of transactions and using the one which is the safest would slow down the code.

For example let's suppose I need to check if an user is administrator before allowing them to delete a social media post. The easiest way would be (pseudocode):

    bool isUserAdmin = checkAdminPrivileges(user);
    if(isUserAdmin){
       deletePost(post);
    }

Needless to say, there is a TOCTOU vulnerability which could be resolved only by using a serializable transaction which is obviously slow.

So, the best option that came to my mind is asking myself whether the portion of code would really harm if exploited. In this case the user may delete a post milliseconds after another process strips them of administrative rights which is not really a problem.

This subjective way of programming kind of hurts my logical brain always looking for potential flaws and its willingness to develop something mathematically or logically proven to be reliable.

What is your insight on this?

2 Answers2

5

So, the best option that came to my mind is asking myself whether the portion of code would really harm if exploited. In this case the user may delete a post milliseconds after another process strips them of administrative rights which is not really a problem.

This is one of those things that it can be easy to get hung up on, but I think if you step back, there's a fairly straightforward way to reason about these kinds of situations. Consider if instead of removing the access a few milliseconds before the user deleted the post, what if the access was removed a few millisecond after the post was deleted. Would that be any worse? If not, then likely that timeframe is too short to be overly worried about.

Time of check / time of use (TOCTOU) is not just relevant to security. It's a relevant question in many areas of system design and in everyday life.

Here's a scenario I've run into: an emails are being created and sent to users in a batch job that runs for a few hours. It's possible for the users to update their email at any moment. Is it OK for the email address to be captured at the beginning of the batch job? Any user's email could change between then and when their email is processed. But there's no particular logic to when exactly someone's email is processed in that batch. Say two users (A & B) change their email address at about the same time in the middle of the batch. User A's email is processed in the beginning of the batch and goes to their old email address. User B's email is processed at the end of the batch and two things could happen: 1. the current email is retrieved or 2. the old email was captured at the beginning of the run and that's what will be used regardless. Now the question is, is option 2 OK? If it's a bug that B's email was delivered to the old address, is it a bug that A's was as well? Both users changed their email at the same time, and it seems pretty arbitrary to say one is a bug and the other isn't. The only way to prevent that across all scenarios would be to wait to send the email in case the address changes and if you follow that logic strictly, it leads to never doing anything.

So, if you are still with me here, you should see that in order to get things done, there has to be a point in time where the decision of what to do is made. That brings us to what I think is your real question: how do you determine how long of a time between the time of check (TOC) and the time of use (TOU) is acceptable? I don't think there's a simple answer to that question. In many cases, there may be no objective answer. It obviously depends on the situation. You wouldn't want an automatic door to close based on sensor data from a minute ago. Using minute old data in a stock purchase analysis would be more than adequate for a typical retail investor but not anywhere close to good enough for a high-frequency trading algorithm.

Probably the most important thing is identifying potential TOCTOU issues. The simplest answer I can provide for deciding what to do about them is to talk to other people, especially stakeholders and the parties who are giving requirements. If you have senior technical people to talk to, they might be able to provide guidance on more esoteric computing issues.

A good way to start working through this is to ask questions like: what could go wrong if this occurs. Imagine an admin's account has been compromised. How would you feel if disabling that account could be done in milliseconds? What about 2 hours? Would you feel comfortable justifying that decision while a known attacker was freely operating in the system during that time? Maybe 5-15 minutes would be OK. But don't feel you need make these decisions on your own unless the correct answer is obvious, or you have no other choice.

On a sidenote: one technique that can help with race-conditions caused by TOCTOU in databases is to add a condition to your update which will cause it to do nothing if not met. Let's say you have some tasks you want to assign to users as rows in a table but there are other instances of this same process in the system that might assign a different user. You can add something like and assigned is null to your update clause. That way if any of those tasks were assigned since your TOC, they will be excluded from the update.

JimmyJames supports Canada
  • 30,578
  • 3
  • 59
  • 108
2

using the one which is the safest would slow down the code.

If you think correct code is slow, you want to see the performance of incorrect code, once you factor all the business malfunction, detective work, and manual cleanup it could involve!

The issue is really about deciding which things in the computer need to be transactionally consistent or not. Over-constraining things does not necessarily lead to more correctness or safety, but simply to a manifestation of different kinds of problems (including additional complexity for users or developers).

And it's also worth remembering that the database engine can't enforce transactional consistency with people's brains, with paper records, with cached displays of data, nor (typically) with any other computer application. Business information systems (which always involve human and paper elements, in addition to computer applications), have to be designed carefully to maintain an appropriate amount of consistency, rather than assuming total consistency.

In a typical business computer application, where a certain permission has already been granted to a login previously, there isn't usually any adverse implication from a small period of run-on, where the permission is recorded centrally as withdrawn, but the computer continues to allow operation for a short while under the previous state of permissions.

What's usually more important is that audit trails showing which login commanded an operation (and from what computer terminal, etc.), are strictly consistent with operations actually executed.

The prospect of permissions being changed, and the user then sneaking through one last operation in the final seconds, has about the same risks and implications as if the user just did the operation a few seconds earlier when a manager had decided to to remove the permissions but hadn't yet actually recorded the decision on the computer.

Even if the application was designed to require strict consistency between an operation and the permission controls, it probably shouldn't require it and should be redesigned to stop requiring it.

This subjective way of programming kind of hurts my logical brain always looking for potential flaws and its willingness to develop something mathematically or logically proven to be reliable.

The reality is that business systems are not "reliable" in a static way. Controlling complexity is important so that staff can supervise things in an ongoing way, reason about what is going on (and what has gone awry), and intervene when necessary and complete an intervention within a reasonable period of time. A "reliable" system is one that has these properties of oversee-ability and intervention-timeliness.

When you get the sense that things are out of control, it's often a sign that you've allowed things to become too complicated to be amenable to ongoing oversight and intervention, and thus a sign that it will be unreliable.

Steve
  • 12,325
  • 2
  • 19
  • 35