31

There seems to be rough equivalents of instructions to equate to the cost of a branch miss virtual functions have a similar tradeoff:

  • instruction vs. data cache miss
  • optimization barrier

If you look at something like:

if (x==1) {
   p->do1();
}
else if (x==2) {
   p->do2();
}
else if (x==3) {
   p->do3();
}
...

You could have a member function array, or if many functions depend on the same categorization, or more complex categorization exists, use virtual functions:

p->do()

But, in general, how expensive are virtual functions vs branching It is hard to test on enough platforms to generalize, so I was wondering if any one had a rough rule of thumb (lovely if it were as simple as 4 ifs is the breakpoint)

In general virtual functions are clearer and I would lean towards them. But, I have several highly critical sections where I can change code from virtual functions to branches. I would prefer to have thoughts on this before I undertake this. (it's not a trivial change, or easy to test across multiple platforms)

Glenn Teitelbaum
  • 434
  • 1
  • 4
  • 8

6 Answers6

31

I wanted to jump in here among these already-excellent answers and admit that I've taken the ugly approach of actually working backwards to the anti-pattern of changing polymorphic code into switches or if/else branches with measured gains. But I didn't do this wholesale, only for the most critical paths. It doesn't have to be so black and white.

As a disclaimer, I work in areas like raytracing where correctness is not so difficult to achieve (and is often fuzzy and approximated anyway) while speed is often one of the most competitive qualities sought out. A reduction in render times is often one of the most common user requests, with us constantly scratching our heads and figuring out how to achieve it for the most critical measured paths.

Polymorphic Refactoring of Conditionals

First, it's worth understanding why polymorphism can be preferable from a maintainability aspect than conditional branching (switch or a bunch of if/else statements). The main benefit here is extensibility.

With polymorphic code, we can introduce a new subtype to our codebase, add instances of it to some polymorphic data structure, and have all the existing polymorphic code still work automagically with no further modifications. If you have a bunch of code scattered throughout a large codebase that resembles the form of, "If this type is 'foo', do that", you might find yourself with a horrible burden of updating 50 disparate sections of code in order to introduce a new type of thing, and still end up missing a few.

The maintainability benefits of polymorphism naturally diminish here if you just have a couple or even one section of your codebase that needs to do such type checks.

Optimization Barrier

I would suggest not looking at this from the standpoint of branching and pipelining so much, and look at it more from the compiler design mindset of optimization barriers. There are ways to improve branch prediction that apply to both cases, like sorting data based on sub-type (if it fits into a sequence).

What differs more between these two strategies is the amount of information the optimizer has in advance. A function call which is known provides a lot more information, an indirect function call which calls an unknown function at compile-time leads to an optimization barrier.

When the function being called is known, compilers can obliterate the structure and squash it down to smithereens, inlining calls, eliminating potential aliasing overhead, doing a better job at instruction/register allocation, possibly even rearranging loops and other forms of branches, generating hard-coded miniature LUTs when appropriate (something GCC 5.3 recently surprised me with a switch statement by using a hard-coded LUT of data for the results rather than a jump table).

Some of those benefits get lost when we start introducing compile-time unknowns into the mix, as with the case of an indirect function call, and that's where conditional branching can most likely offer an edge.

Memory Optimization

Take an example of a video game which consists of processing a sequence of creatures repeatedly in a tight loop. In such a case, we might have some polymorphic container like this:

vector<Creature*> creatures;

Note: for simplicity I avoided unique_ptr here.

... where Creature is a polymorphic base type. In this case, one of the difficulties with polymorphic containers is that they often want to allocate memory for each subtype separately/individually (ex: using default throwing operator new for each individual creature).

That will often make the first prioritization for optimization (should we need it) memory-based rather than branching. One strategy here is to use a fixed allocator for each sub-type, encouraging a contiguous representation by allocating in large chunks and pooling memory for each sub-type being allocated. With such a strategy, it can definitely help to sort this creatures container by sub-type (as well as address), as that is not only possibly improving branch prediction but also improving locality of reference (allowing multiple creatures of the same subtype to be accessed from a single cache line prior to eviction).

Partial Devirtualization of Data Structures and Loops

Let's say you went through all these motions and you still desire more speed. It's worth noting that each step we venture here is degrading maintainability, and we'll already be at a somewhat metal-grinding stage with diminishing performance returns. So there needs to be a pretty significant performance demand if we tread into this territory, where we're willing to sacrifice maintainability even further for smaller and smaller performance gains.

Yet the next step to try (and always with a willingness to back out our changes if it doesn't help at all) might be manual devirtualization.

Version control tip: unless you're far more optimization-savvy than me, it can be worth creating a new branch at this point with a willingness to toss it away if our optimization efforts miss which may very well happen. For me it's all trial and error after these kinds of points even with a profiler in hand.

Nevertheless, we don't have to apply this mindset wholesale. Continuing our example, let's say this video game consists mostly of human creatures, by far. In such a case, we can devirtualize only human creatures by hoisting them out and creating a separate data structure just for them.

vector<Human> humans;               // common case
vector<Creature*> other_creatures;  // additional rare-case creatures

This implies that all the areas in our codebase which need to process creatures need a separate special-case loop for human creatures. Yet that eliminates the dynamic dispatch overhead (or perhaps, more appropriately, optimization barrier) for humans which are, by far, the most common creature type. If these areas are large in number and we can afford it, we might do this:

vector<Human> humans;               // common case
vector<Creature*> other_creatures;  // additional rare-case creatures
vector<Creature*> creatures;        // contains humans and other creatures

... if we can afford this, the less critical paths can stay as they are and simply process all creature types abstractly. The critical paths can process humans in one loop and other_creatures in a second loop.

We can extend this strategy as needed and potentially squeeze some gains this way, yet it's worth noting how much we're degrading maintainability in the process. Using function templates here can help to generate the code for both humans and creatures without duplicating the logic manually.

Partial Devirtualization of Classes

Something I did years ago which was really gross, and I'm not even sure it's beneficial anymore (this was in C++03 era), was partial devirtualization of a class. In that case, we were already storing a class ID with each instance for other purposes (accessed through an accessor in the base class which was non-virtual). There we did something analogical to this (my memory is a little hazy):

switch (obj->type())
{
   case id_common_type:
       static_cast<CommonType*>(obj)->non_virtual_do_something();
       break;
   ...
   default:
       obj->virtual_do_something();
       break;
}

... where virtual_do_something was implemented to call non-virtual versions in a subclass. It's gross, I know, doing an explicit static downcast to devirtualize a function call. I have no idea how beneficial this is now as I have not tried this type of thing for years. With an exposure to data-oriented design, I found the above strategy of splitting up data structures and loops in a hot/cold fashion to be far more useful, opening up more doors for optimization strategies (and far less ugly).

Wholesale Devirtualization

I must admit that I've never gotten this far applying an optimization mindset, so I have no idea of the benefits. I have avoided indirect functions in foresight in cases where I knew there was only going to be one central set of conditionals (ex: event processing with only one central place processing events), but never started off with a polymorphic mindset and optimized all the way up to here.

Theoretically, the immediate benefits here might be a potentially smaller way of identifying a type than a virtual pointer (ex: a single byte if you can commit to the idea that there are 256 unique types or less) in addition to completely obliterating these optimization barriers.

It might also help in some cases to write easier-to-maintain code (versus the optimized manual devirtualization examples above) if you just use one central switch statement without having to split up your data structures and loops based on subtype, or if there's an order-dependency in these cases where things have to be processed in a precise order (even if that causes us to branch all over the place). This would be for cases where you don't have too many places that need to do the switch.

I would generally not recommend this even with a very performance-critical mindset unless this is reasonably easy to maintain. "Easy to maintain" would tend to hinge on two dominant factors:

  • Not having a real extensibility need (ex: knowing for sure that you have exactly 8 types of things to process, and never any more).
  • Not having many places in your code that need to check these types (ex: one central place).

... yet I recommend the above scenario in most cases and iterating towards more efficient solutions by partial devirtualization as needed. It gives you a lot more breathing room to balance extensibility and maintainability needs with performance.

Virtual Functions vs. Function Pointers

To kind of top this off, I noticed here that there was some discussion about virtual functions vs. function pointers. It is true that virtual functions require a little extra work to call, but that doesn't mean they are slower. Counter-intuitively, it may even make them faster.

It's counter-intuitive here because we're used to measuring cost in terms of instructions without paying attention to the dynamics of the memory hierarchy which tend to have a much more significant impact.

If we're comparing a class with 20 virtual functions vs. a struct which stores 20 function pointers, and both are instantiated multiple times, the memory overhead of each class instance in this case 8 bytes for the virtual pointer on 64-bit machines, while the memory overhead of the struct is 160 bytes.

The practical cost there can be a whole lot more compulsory and non-compulsory cache misses with the table of function pointers vs. the class using virtual functions (and possibly page faults at a large enough input scale). That cost tends to dwarf the slightly extra work of indexing a virtual table.

I've also dealt with legacy C codebases (older than I am) where turning such structs filled with function pointers, and instantiated numerous times, actually gave significant performance gains (over 100% improvements) by turning them into classes with virtual functions, and simply due to the massive reduction in memory use, the increased cache-friendliness, etc.

On the flip side, when comparisons become more about apples to apples, I've likewise found the opposite mindset of translating from a C++ virtual function mindset to C-style function pointer mindset to be useful in these types of scenarios:

class Functionoid
{
public:
    virtual ~Functionoid() {}
    virtual void operator()() = 0;
};

... where the class was storing a single measly overridable function (or two if we count the virtual destructor). In those cases, it can definitely help in critical paths to turn that into this:

void (*func_ptr)(void* instance_data);

... ideally behind a type-safe interface to hide the dangerous casts to/from void*.

In those cases where we're tempted to use a class with a single virtual function, it can quickly help to use function pointers instead. A big reason isn't even necessarily the reduced cost in calling a function pointer. It's because we no longer face the temptation to allocate each separate functionoid on the scattered regions of the heap if we're aggregating them into a persistent structure. This kind of approach can make it easier to avoid heap-associated and memory fragmentation overhead if the instance data is homogeneous, e.g., and only the behavior varies.

So there's definitely some cases where using function pointers can help, but often I've found it the other way around if we're comparing a bunch of tables of function pointers to a single vtable which only requires one pointer to be stored per class instance. That vtable will often be sitting in one or more L1 cache lines as well in tight loops.

Conclusion

So anyway, that's my little spin on this topic. I recommend venturing in these areas with caution. Trust measurements, not instinct, and given the way these optimizations often degrade maintainability, only go as far as you can afford (and a wise route would be to err on the side of maintainability).

marstato
  • 4,638
13

Observations:

  • With many cases, virtual functions are faster because the vtable lookup is an O(1) operation while the else if() ladder is an O(n) operation. However, this is only true if the distribution of cases is flat.

  • For a single if() ... else, the conditional is faster because you save the function call overhead.

  • So, when you have a flat distribution of cases, a break-even point must exist. The only question is where it is located.

  • If you use a switch() instead of else if() ladder or virtual function calls, your compiler may produce even better code: it can do a branch to a location which is looked up from a table, but which is not a function call. That is, you have all the properties of the virtual function call without all the function call overhead.

  • If one is much more frequent than the rest, starting an if() ... else with that case will give you the best performance: You will execute a single conditional branch which is correctly predicted in most of the cases.

  • Your compiler has no knowledge of the expected distribution of cases and will assume a flat distribution.

Since your compiler likely has some good heuristics in place as to when to code a switch() as an else if() ladder or as a table lookup. I would tend to trust its judgment unless you know that the distribution of cases is biased.

So, my advise is this:

  • If one of the cases dwarfes the rest in terms of frequency, use a sorted else if() ladder.

  • Otherwise use a switch() statement, unless one of the other methods makes your code much more readable. Be sure that you don't buy a neglegible performance gain with significantly reduced readability.

  • If you used a switch() and are still not satisfied by the performance, do the comparison, but be prepared to find out that the switch() was already the fastest possibility.

7

In general, is it worth using virtual functions to avoid branching?

In general, yes. The benefits for maintenance are significant (testing in separation, separation of concerns, improved modularity and extensibility).

But, in general, how expensive are virtual functions vs branching It is hard to test on enough platforms to generalize, so I was wondering if any one had a rough rule of thumb (lovely if it were as simple as 4 ifs is the breakpoint)

Unless you have profiled your code and know the dispatch between branches (the conditions evaluation) takes more time than the computations performed (the code in the branches), optimize the computations performed.

That is, the correct answer to "how expensive are virtual functions vs branching" is measure and find out.

Rule of thumb: unless have the situation above (branch discrimination more expensive than branch computations), optimize this part of the code for maintenance effort (use virtual functions).

You say that you want this section to run as fast as possible; How fast is that? What's your concrete requirement?

In general virtual functions are clearer and I would lean towards them. But, I have several highly critical sections where I can change code from virtual functions to branches. I would prefer to have thoughts on this before I undertake this. (it's not a trivial change, or easy to test across multiple platforms)

Use virtual functions then. This will even allow you to optimize per platform if necessary, and still keep client code clean.

utnapistim
  • 5,313
7

The other answers already provide good theoretical arguments. I'd like to add the results of an experiment I have performed recently to estimate whether it would be a good idea to implement a virtual machine (VM) using a large switch over the op-code or rather interpret the op-code as an index into an array of function pointers. While this is not exactly the same as a virtual function call, I think it is reasonably close.

I have written a Python script to randomly generate C++14 code for a VM with an instruction set size picked randomly (albeit not uniformly, sampling the low range more densely) between 1 and 10000. The generated VM always had 128 registers and no RAM. The instructions are not meaningful and all have the following form.

inline void
op0004(machine_state& state) noexcept
{
  const auto c = word_t {0xcf2802e8d0baca1dUL};
  const auto r1 = state.registers[58];
  const auto r2 = state.registers[69];
  const auto r3 = ((r1 + c) | r2);
  state.registers[6] = r3;
}

The script also generates dispatch routines using a switch statement…

inline int
dispatch(machine_state& state, const opcode_t opcode) noexcept
{
  switch (opcode)
  {
  case 0x0000: op0000(state); return 0;
  case 0x0001: op0001(state); return 0;
  // ...
  case 0x247a: op247a(state); return 0;
  case 0x247b: op247b(state); return 0;
  default:
    return -1;  // invalid opcode
  }
}

…and an array of function pointers.

inline int
dispatch(machine_state& state, const opcode_t opcode) noexcept
{
  typedef void (* func_type)(machine_state&);
  static const func_type table[VM_NUM_INSTRUCTIONS] = {
    op0000,
    op0001,
    // ...
    op247a,
    op247b,
  };
  if (opcode >= VM_NUM_INSTRUCTIONS)
    return -1;  // invalid opcode
  table[opcode](state);
  return 0;
}

Which dispatch routine was generated was chosen randomly for each generated VM.

For benchmarking, the stream of op-codes was generated by a randomly seeded (std::random_device) Mersenne twister random engine (std::mt19937_64).

The code for each VM was compiled with GCC 5.2.0 using the -DNDEBUG, -O3 and -std=c++14 switches. First, it was compiled using the -fprofile-generate option and profile data collected for simulating 1000 random instructions. The code was then re-compiled with the -fprofile-use option allowing optimizations based on the collected profile data.

The VM was then exercised (in the same process) four times for 50 000 000 cycles and the time for each run measured. The first run was discarded to eliminate cold-cache effects. The PRNG was not re-seeded between the runs so that they did not perform the same sequence of instructions.

Using this setup, 1000 data points for each dispatching routine were collected. The data was collected on a quad core AMD A8-6600K APU with 2048 KiB cache running 64 bit GNU/Linux without a graphical desktop or other programs running. Shown below is a plot of the average CPU time (with standard deviation) per instruction for each VM.

enter image description here

From this data, I could gain confidence that using a function table is a good idea except maybe for a very small number of op-codes. I do not have an explanation for the outliers of the switch version between 500 and 1000 instructions.

All source code for the benchmark as well as the full experimental data and a high resolution plot can be found on my website.

5gon12eder
  • 7,236
4

In addition to cmaster's good answer, which I upvoted, keep in mind that function pointers are generally strictly faster than virtual functions. Virtual functions dispatch generally involves first following a pointer from the object to the vtable, indexing appropriately, and then dereferencing a function pointer. So the final step is the same, but there are extra steps initially. In addition, virtual functions always take "this" as an argument, function pointers are more flexible.

Another thing to keep in mind: if your critical path involves a loop, it can be helpful to sort the loop by dispatch destination. Obviously this is nlogn, whereas traversing the loop is only n, but if you are going to traverse many times this can be worth it. By sorting by dispatch destination, you ensure that the same code is executed repeatedly, keeping it hot in icache, minimizing cache misses.

A third strategy to keep in mind: if you decide to move away from virtual functions/function pointers towards if/switch strategies, you may also be well served by switching from polymorphic objects to something like boost::variant (which also provides the switch case in the form of the visitor abstraction). Polymorphic objects have to be stored by base pointer, so your data is all over the place in cache. This could easily be a bigger influence on your critical path than the cost of virtual lookup. Whereas variant is stored inline as a discriminated union; it has size equal to the largest data type (plus a small constant). If your objects don't differ in size too much, this is a great way to handle them.

Actually, I wouldn't be surprised if improving the cache coherency of your data would have a bigger impact than your original question, so I'd definitely look more into that.

Nir Friedman
  • 1,465
4

May I just explain why I think this is an XY-problem? (You're not alone in asking them.)

I assume your real goal is to save time overall, not just to understand a point about cache-misses and virtual functions.

Here's an example of real performance tuning, in real software.

In real software, things get done that, no matter how experienced the programmer is, could be done better. One doesn't know what they are until the program is written and performance tuning can be done. There are nearly always more than one way to speed up the program. After all, to say a program is optimal, you are saying that in the pantheon of possible programs to solve your problem, none of them take less time. Really?

In the example I linked to, it originally took 2700 microseconds per "job". A series of six problems were fixed, going counter-clockwise around the pizza. The first speedup removed 33% of the time. The second one removed 11%. But notice, the second one wasn't 11% at time it was found, it was 16%, because the first problem was gone. Similarly, the third problem was magnified from 7.4% to 13% (almost double) because the first two problems were gone.

At the end, this magnification process allowed all but 3.7 microseconds to be eliminated. That's 0.14% of the original time, or a speedup of 730x.

enter image description here

Removing the initially large problems gives a moderate amount of speedup, but they pave the way for removing later problems. These later problems could have initially been insignificant parts of the total, but after early problems are removed, these small ones become large and can produce big speedups. (It's important to understand that, to get this result, none can be missed, and this post shows how easily they can be.)

enter image description here

Was the final program optimal? Probably not. None of the speedups had anything to do with cache misses. Would cache misses matter now? Maybe.

EDIT: I'm getting downvotes from people homing in on the "highly critical sections" of the OP's question. You don't know something is "highly critical" until you know what fraction of time it accounts for. If the average cost of those methods being called is 10 cycles or more, over time, the method of dispatching to them is probably not "critical", compared to what they are actually doing. I see this over and over, where people treat "needing every nanosecond" as a reason to be penny-wise and pound-foolish.

Mike Dunlavey
  • 12,905