Do any compilers do this optimization for virtual calls?

Question

This just came to mind, and not really sure how to search for this.

Let's say you have the following classes

class A
{
public:
    virtual void Foo() = 0;

    virtual void ManyFoo(int N) 
    {
        for (int i = 0; i < N; ++i) Foo();
    } 
};

class B : public A
{
public:
    virtual void Foo()
    {
        // Do something
    }
};

Do any compilers create a version of ManyFoo() for B that inlines the call to B::Foo()?

If not, does making B a final class enable this optimization?

Edit: I was specifically wondering whether this was done anywhere for virtual calls (so, aside from when the whole call to ManyFoo() is inlined itself).

Alex Celeste · Answer 1 · 2018-09-06T10:36:31.880

I believe the term you're looking for is "devirtualization".

Anyway, did you try it? If we put that example in Compiler Explorer:

extern void extCall ();

class A
{
public:
    virtual void Foo() const = 0;

    virtual void ManyFoo(int N) const
    {
        for (int i = 0; i < N; ++i) Foo();
    } 
};

class B final : public A
{
public:
    virtual void Foo() const
    {
        extCall ();
    }
};

void b_value_foo (B b) {
    b.ManyFoo (6);
}

void b_ref_foo (B const & b) {
    b.ManyFoo (6);
}

void b_indirect_foo (B b) {
    b_ref_foo (b);
}

...GCC is able to produce the following with -Os:

b_value_foo(B):
        push    rax
        call    extCall()
        call    extCall()
        call    extCall()
        call    extCall()
        call    extCall()
        pop     rdx
        jmp     extCall()
b_ref_foo(B const&):
        mov     rax, QWORD PTR [rdi]
        mov     esi, 6
        mov     rax, QWORD PTR [rax+8]
        jmp     rax
b_indirect_foo(B):
        jmp     b_ref_foo(B const&)

It will inline through the virtual call when it's 100% sure of the concrete type of the object b (n.b. if we change -Os to -O2 it will also fully inline b_indirect_foo). But it can't be sure of the concrete type of an object it can only see by a reference that it can't trace back to an instance, and it doesn't seem to trust final annotations to overrule this (probably because this would be very ABI-fragile; I personally wouldn't want it to). It will trust final annotations on member functions though, but your example precludes that by its structure.

GCC has had this optimization for several versions. Clang and MSVC don't seem to do it in this case (but do advertise the feature), so the power clearly varies a lot between examples and compilers.

gnasher729 · Answer 2 · 2018-09-06T16:55:03.510

It's a good guess, but not necessarily true, that when this->ManyFoo() called the implementation inside A, the implementation of this->Foo() will also be the one inside A. So the compiler could generate pseudo-code for ManyFoo like this:

if (&this->Foo == &A->Foo) {
    for (int i = 0; i < N; ++i)
        inlined A->Foo();
} else {
    for (int i = 0; i < N; ++i)
        virtual this->Foo();
}

The compiler could also take the address of this->Foo() once, then call that function pointer instead of this->Foo(), if it is faster. The compiler could also just inline the call to Foo() inside ManyFoo(), and whereever Foo() is overloaded, create a new version of ManyFoo().

I have seen Java VMs that decided at runtime what to inline, by keeping track for a while which implementation is usually called, and then inlining (of course in a safe way, so if a different implementation of Foo was called, it would work, but slower). So if you end up calling C->Foo() in 99% of cases, then that case would be checked and inlined. This wouldn't be clever enough to have one inlined version for class C, and one for class D.

score 0 · Answer 3 · answered Sep 06 '18 at 10:13

Yes, kind of. As your methods are defined inline, they can be sometimes inlined. Neither Clang nor GCC create a specialized B::ManyFoo(int), though.

I've amended the code to prevent unsuitable optimizations, and illustrate some behaviour:

struct A {
    virtual int Foo() = 0;
    virtual int ManyFoo(int N)  {
        int res = 0;
        for (int i = 0; i < N; ++i) res += Foo();
        return res;
    } 
};

struct B : A {
    virtual int Foo() { return 3; }
};

B force_code_generation() { return {}; }

int dynamic_dispatch(A& object) { return object.ManyFoo(2); }
int static_dispatch (B value)   { return value .ManyFoo(2); }

At a non-virtual callsite where the compiler is able to do static dispatch, ManyFoo() and Foo() may be completely flattened away. GCC 8.2 with -O2 is able to evaluate the function at compile time:

mov eax, 6
ret

But Clang doesn't seem to do that optimization. It merely inlines the ManyFoo(2) call, which uses virtual calls to invoke Foo(). Pseudo-code:

static_dispatch(B* rdi):
        push    rbp
        push    rbx
        push    rax

        rbx = rdi

        rax = *rbx  // load vtable
        eax = call rax[0](rdi)  // first Foo() call
        ebp = eax

        rax = *rbx  // load vtable
        rdi = rbx  // move this pointer to rdi
        eax = call rax[0](rdi)
        eax += ebp  // add the Foo() results

        rsp += 8  // discard saved rax
        pop     rbx
        pop     rbp
        return eax

With dynamic dispatch these optimizations are not generally possible. Clang adds no special optimizations and simply uses ordinary virtual calls. However, GCC 8.2 adds guards at the virtual callsites to optionally inline the virtual function. Here's the generated assembly rewritten as pseudo-code and reordered for clarity:

dynamic_dispatch(A& rdi):
        rax = *rdi  // load vtable from object
        rdx = rax[8]  // load ManyFoo(int) vtable entry
        // check if ManyFoo(int) method is A::ManyFoo(int)
        if (rdx != &A::ManyFoo(int)) {
            // fallback for virtual ManyFoo(2) call, and return
            esi = 2
            goto rdx  // tailcall
        }

        // We are now in the specialized A::ManyFoo(2) version.
        // The loop for N=2 is unrolled.
        push    rbp
        push    rbx
        rsp -= 8

        // first Foo() call:
        // check if Foo() is B::Foo(), else fall back to virtual call
        ebx = 3  // result of the first B::Foo() call if it is inlined
        rax = rax[0]  // load Foo() vtable entry
        if (rax != &B::Foo()) {
            // fallback for first virtual Foo() call
            rbp = rdi
            eax = call rax(rdi)
            ebx = eax  // save result of first call

            // second Foo() call:
            // check again if Foo() is B::Foo()
            // Can "this" even change its type???
            rax = *rbp
            rax = rax[0]
            if (rax != &B::Foo()) {
                // fallback for second virtual Foo() call
                rdi = rbp
                eax = call rax(rdi)
                goto end
            }
        }

        eax = 3  // result of second B::Foo() call

end:
        // add the result of the calls and return
        eax += ebx
        rsp += 8
        pop     rbx
        pop     rbp
        return eax

Neither Clang nor GCC change the generated code depending on whether B is final.

Source: view the assembly on the Godbolt Compiler Explorer.

Do any compilers do this optimization for virtual calls?

3 Answers3

Linked