codersnotes

Beating The Compiler November 28th, 2016

An oft-repeated fact on programming forums these days is that a decent optimizing compiler will always beat a puny human's attempt at hand-written assembler. There are rare cases, like MPEG decoders, where making good use of the SIMD instructions can allow assembly to massively beat the compiler. But generally you'll hear that the compiler will always do better.

The reason given for this is usually that a modern CPU has so many pipelines and instruction hazards to deal with, and a naive assembly routine won't do as good a job dealing with them.

But is it true? Let's not just take some guys word on the Internet as gospel, let's do a little experiment and find out.

We'll just take a simple piece of code to look at here. I'm not going to pick an example that would benefit heavily from exotic intrinsics. Instead I'll just use an old standard, quicksort.

Here's the naive C++ quicksort we'll be testing against:

struct Item { int key; int value; };

extern "C" 
void sortRoutine(Item *items, int count)
{
    if (count <= 1)
        return; // already sorted if only zero/one element

    // Pick the pivot.
    Item pivot = items[count-1];
    int low = 0;
    for (int pos=0;pos<count-1;pos++)
    {
        if (items[pos].key <= pivot.key) {
            // swap elements
            Item tmp = items[pos];
            items[pos] = items[low];
            items[low] = tmp;

            low++;
        }
    }

    items[count-1] = items[low];
    items[low] = pivot;

    sortRoutine(items, low);
    sortRoutine(items+low+1, count-1-low);
}

It's nothing fancy. It's not the best sorting algorithm in the world, and this isn't even the best implementation of it, but it's simple and it'll do just fine.

Now let's try a direct naive assembly implementation of it:

sortRoutine:
    ; rcx = items
    ; rdx = count
    sub rdx, 1
    jbe done

    ; Pick the pivot.
    mov r8, [rcx+rdx*8] ; r8 = pivot data
    xor r9, r9          ; r9 = low
    xor r10, r10        ; r10 = pos
partition:
    cmp [rcx+r10*8], r8d
    jg noswap

    ; swap elements
    mov rax, [rcx+r10*8]
    mov r11, [rcx+r9*8]
    mov [rcx+r9*8], rax
    mov [rcx+r10*8], r11
    inc r9

noswap:
    inc r10
    cmp r10, rdx
    jb partition

    ; move pivot into place
    mov rax, [rcx+r9*8]
    mov [rcx+rdx*8], rax
    mov [rcx+r9*8], r8

    ; recurse
    sub rdx, r9
    lea rax, [rcx+r9*8+8]
    push rax
    push rdx
    mov rdx, r9
    call sortRoutine
    pop rdx
    pop rcx
    test rdx, rdx
    jnz sortRoutine

done:
    ret

It was quite easy to write this, largely due to Intel's flexible memory addressing operators. What's interesting is that I made no real attempt to pay attention to scheduling, pipelining, etc. I just wrote it out as simple assembly.

Now let's time it. I wrote a simple test harness that sorts a 1000000-item array. I run the test 100 times and take the best-case across the whole set. I'll compile the C++ version with gcc 4.8.1, clang 3.8.0, and MSVC 2013.

Results

sort_cpp_recurse_gcc.exe      : Took 99ms best-case across 100 runs
sort_cpp_recurse_clang.exe    : Took 99ms best-case across 100 runs
sort_cpp_recurse_ms.exe       : Took 98ms best-case across 100 runs
sort_asm_recurse.exe          : Took 92ms best-case across 100 runs

Well that's interesting. The different compilers did mostly around the same, with MSVC having a slight edge. But the assembly version ran fastest, being a good 7% faster in this case.

Now the thing about C++ is that it's not always a good representation of the underlying machine. It's fine when it comes to variables, but it's representation of the stack is very limited. C++ pretends we can only use the stack as a call stack, whereas in reality one of the things we can do is to use it as a data stack instead.

So let's try that and see what happens. We'll remove the recursive calls to sortRoutine, and instead push/pop our data ranges directly from the stack. This allows us to run in a single loop without having to actually recurse. This can often have a strong benefit because it removes the overhead of entering/exiting the function each time.

I won't post the code for it here, but it's in the zipfile below if you want to see it.

sort_cpp_recurse_gcc.exe      : Took 99ms best-case across 100 runs
sort_cpp_recurse_clang.exe    : Took 99ms best-case across 100 runs
sort_cpp_recurse_ms.exe       : Took 98ms best-case across 100 runs
sort_asm_recurse.exe          : Took 92ms best-case across 100 runs
sort_cpp_iter_gcc.exe         : Took 106ms best-case across 100 runs
sort_cpp_iter_clang.exe       : Took 97ms best-case across 100 runs
sort_cpp_iter_ms.exe          : Took 95ms best-case across 100 runs
sort_asm_iter.exe             : Took 92ms best-case across 100 runs

Hmm. The assembly version comes out almost the same. I suspect this is because while an iterative approach removes the function-setup overhead, our hand-written x64 version doesn't actually have much function-setup overhead, so doesn't really benefit.

But for the C++ versions, it's a different story. Most of them got a slight speedup, but gcc is actually slower! As far as I can tell from looking at the disassembly, it seems like it's managed to out-clever itself. The increased control paths and things for it to consider have given it too many balls to keep in the air at once.

I compiled these tests on x64, where the default calling convention is fastcall. I suspect that if compiled for 32-bit instead, using the cdecl stack-based convention, the non-recursive version would have done better relatively. I didn't try it though, I'll leave that as an exercise to the reader.

Conclusion

So it seems like the old "modern compilers are always faster than you" spiel is not necessarily true. What is true however, is that the compiler did a good enough job, and the code is easier to work with. So while you might be able to squeeze a little more speed out, it's probably not worth the maintenance hassle.

The assembly version was faster though. I suppose if there's anything to be learned here, it's that people on the Internet may sometimes be full of shit.

Download

sorttest.zip - All the code used for this article.

Written by Richard Mitton,

software engineer and travelling wizard.

Follow me on twitter: http://twitter.com/grumpygiant