Beating The Compiler November 28th, 2016
An oft-repeated fact on programming forums these days is that a decent optimizing compiler will always beat a puny human's attempt at hand-written assembler. There are rare cases, like MPEG decoders, where making good use of the SIMD instructions can allow assembly to massively beat the compiler. But generally you'll hear that the compiler will always do better.
The reason given for this is usually that a modern CPU has so many pipelines and instruction hazards to deal with, and a naive assembly routine won't do as good a job dealing with them.
But is it true? Let's not just take some guys word on the Internet as gospel, let's do a little experiment and find out.
We'll just take a simple piece of code to look at here. I'm not going to pick an example that would benefit heavily from exotic intrinsics. Instead I'll just use an old standard, quicksort.
Here's the naive C++ quicksort we'll be testing against:
struct Item { int key; int value; };
extern "C"
void sortRoutine(Item *items, int count)
{
if (count <= 1)
return; // already sorted if only zero/one element
// Pick the pivot.
Item pivot = items[count-1];
int low = 0;
for (int pos=0;pos<count-1;pos++)
{
if (items[pos].key <= pivot.key) {
// swap elements
Item tmp = items[pos];
items[pos] = items[low];
items[low] = tmp;
low++;
}
}
items[count-1] = items[low];
items[low] = pivot;
sortRoutine(items, low);
sortRoutine(items+low+1, count-1-low);
}
It's nothing fancy. It's not the best sorting algorithm in the world, and this isn't even the best implementation of it, but it's simple and it'll do just fine.
Now let's try a direct naive assembly implementation of it:
sortRoutine:
; rcx = items
; rdx = count
sub rdx, 1
jbe done
; Pick the pivot.
mov r8, [rcx+rdx*8] ; r8 = pivot data
xor r9, r9 ; r9 = low
xor r10, r10 ; r10 = pos
partition:
cmp [rcx+r10*8], r8d
jg noswap
; swap elements
mov rax, [rcx+r10*8]
mov r11, [rcx+r9*8]
mov [rcx+r9*8], rax
mov [rcx+r10*8], r11
inc r9
noswap:
inc r10
cmp r10, rdx
jb partition
; move pivot into place
mov rax, [rcx+r9*8]
mov [rcx+rdx*8], rax
mov [rcx+r9*8], r8
; recurse
sub rdx, r9
lea rax, [rcx+r9*8+8]
push rax
push rdx
mov rdx, r9
call sortRoutine
pop rdx
pop rcx
test rdx, rdx
jnz sortRoutine
done:
ret
It was quite easy to write this, largely due to Intel's flexible memory addressing operators. What's interesting is that I made no real attempt to pay attention to scheduling, pipelining, etc. I just wrote it out as simple assembly.
Now let's time it. I wrote a simple test harness that sorts a 1000000-item array. I run the test 100 times and take the best-case across the whole set. I'll compile the C++ version with gcc 4.8.1, clang 3.8.0, and MSVC 2013.
Results
sort_cpp_recurse_gcc.exe : Took 99ms best-case across 100 runs
sort_cpp_recurse_clang.exe : Took 99ms best-case across 100 runs
sort_cpp_recurse_ms.exe : Took 98ms best-case across 100 runs
sort_asm_recurse.exe : Took 92ms best-case across 100 runs
Well that's interesting. The different compilers did mostly around the same, with MSVC having a slight edge. But the assembly version ran fastest, being a good 7% faster in this case.
Now the thing about C++ is that it's not always a good representation of the underlying machine. It's fine when it comes to variables, but it's representation of the stack is very limited. C++ pretends we can only use the stack as a call stack, whereas in reality one of the things we can do is to use it as a data stack instead.
So let's try that and see what happens. We'll remove the recursive calls to sortRoutine, and instead push/pop our data ranges directly from the stack. This allows us to run in a single loop without having to actually recurse. This can often have a strong benefit because it removes the overhead of entering/exiting the function each time.
I won't post the code for it here, but it's in the zipfile below if you want to see it.
sort_cpp_recurse_gcc.exe : Took 99ms best-case across 100 runs
sort_cpp_recurse_clang.exe : Took 99ms best-case across 100 runs
sort_cpp_recurse_ms.exe : Took 98ms best-case across 100 runs
sort_asm_recurse.exe : Took 92ms best-case across 100 runs
sort_cpp_iter_gcc.exe : Took 106ms best-case across 100 runs
sort_cpp_iter_clang.exe : Took 97ms best-case across 100 runs
sort_cpp_iter_ms.exe : Took 95ms best-case across 100 runs
sort_asm_iter.exe : Took 92ms best-case across 100 runs
Hmm. The assembly version comes out almost the same. I suspect this is because while an iterative approach removes the function-setup overhead, our hand-written x64 version doesn't actually have much function-setup overhead, so doesn't really benefit.
But for the C++ versions, it's a different story. Most of them got a slight speedup, but gcc is actually slower! As far as I can tell from looking at the disassembly, it seems like it's managed to out-clever itself. The increased control paths and things for it to consider have given it too many balls to keep in the air at once.
I compiled these tests on x64, where the default calling convention is fastcall. I suspect that if compiled for 32-bit instead, using the cdecl stack-based convention, the non-recursive version would have done better relatively. I didn't try it though, I'll leave that as an exercise to the reader.
Conclusion
So it seems like the old "modern compilers are always faster than you" spiel is not necessarily true. What is true however, is that the compiler did a good enough job, and the code is easier to work with. So while you might be able to squeeze a little more speed out, it's probably not worth the maintenance hassle.
The assembly version was faster though. I suppose if there's anything to be learned here, it's that people on the Internet may sometimes be full of shit.
Download
sorttest.zip - All the code used for this article.
Written by Richard Mitton,
software engineer and travelling wizard.
Follow me on twitter: http://twitter.com/grumpygiant