Assembling a Lower Level understanding

Written by

Time for something different from my usual bloggings about JavaScript and the BBB project. Alongside it, I’m on a team of five working to produce a 3D game using C/C++ and DirectX. Luckily, we’ve been granted a custom framework by our professor but it’s still a lot of tough work. Where as many developers take a “get it working and make it look nice”, it has become habit to me to incorporate one other maxim: “get it working blindly fast”. And it’s being put to the test.

Enter real-time 3D systems, where you have 1/60 of a second to do every calculation required for your application to run. Sure, you can go a little further, most people won’t notice a difference between 30 fps and 60 fps, but with 60 fps being the norm… well it’s tough. And in perf-grinding fashion, I decided I’d take a shot at collision detection where there’s iterations upon iterations of computations.

THE REHEARSAL

First shot went alright. It worked in theory but when implementing it… 1 fpm. No, not 1 frame per second, 1 frame per minute. Turns out I’d underestimated the effect of a nested loop on a data set of size 10,000. So I had to start thinking deeper. And lower. It crossed my mind that to make this work, I may have to cross that hallowed C/Assembler boundary and to do it on my own. So I did some research. Basically all I wanted was an absolute value of some arithmetic to operate inside a nested loop. First shot looked something like this (note comments in assembly start with a semicolon):

register int diff = collidable[j]->getPartition();

__asm {
mov edx,[part1] ;Move part1 to register edx
cmp eax,edx ;Compare the accumulator (eax) which should still hold the value of diff (which is also stored in another register) to edx
jge lblSubtract ;Jump to lblSubtract label if edx is greater than or equal to eax
xchg eax,edx ;Exchange register values
lblSubtract: sub eax,edx ;Subtract edx from eax and store result in eax
mov , eax
}

A few things I learned about assembly (ASM) first: Intel processors are optimized for operation to occur on the eax (accumulator) register, with data for operations coming from the edx (data) register. Other convention registers exist (ecx for count variables, for example) but I wanted to get my feet wet, not drown. For the interested, some have it down to an art.

Also, ASM is very different from C. C does all the memory movement for you, but in ASM you must indicate that you want memory moved to the registers to work with. You can issue instructions to work with it in memory, but these tend to be much more expensive in terms of clock cycles and bytes. Exactly the repercussions of these I’m still learning, I know it’s like saturated fats: keep ’em low. For each desired action too, you must issue the instruction. I’ve commented a few above, but a complete list of Intel instructions (up until 80486) can be found here. I found manuals on the rest of the Intel lineup too.

I kept digging.

THE TESTS

The first approach seems like it’s a pretty good way, huh? Turns out it isn’t. I did some more exploring and learned a few more things. I had three test cases to find the absolute value of a subtraction:

1. Ternary operations

register int diff collidable[j]->getPartition() > part1 ? collidable[j]->getPartition() - part1 : part1 - collidable[j]->getPartition();

2. Negative multiplication

register int diff collidable[j]->getPartition()-part1;
if (diff < 0)
diff *= -1;

3. Inline Assembly (repeated from above)

register int diff = collidable[j]->getPartition();

__asm {
mov edx,[part1] ;Move part1 to register edx
cmp eax,edx ;Compare the accumulator (eax) which should still hold the value of diff (which is also stored in another register) to edx
jge lblSubtract ;Jump to lblSubtract label if edx is greater than or equal to eax
xchg eax,edx ;Exchange register values
lblSubtract: sub eax,edx ;Subtract edx from eax and store result in eax
mov , eax
}

I tested each both with and without compiler optimizations. The results surprised me.

Option 1, the ternary operator, was incredibly slow unoptimized. Checking the disassembly, that one line of C code compiled down to an astonishing 41 instructions. Optimized fared significantly better at 17 instructions. Looking good so far. Not really, actually. The inline assembly approach yielded 17 instructions unoptimized and a blitzing 11 optimized. So with inline assembly I could debug with the performance of an optimized build. Of course I lock it into Intel and AMD processors and leave the PowerPC out but who needs anything but Windows? We’re using DirectX, surely there’s no possibility of porting the framework to OpenGL? Actually, there may be. Oops.

But there was still approach 2. That couldn’t be any good though, it is 3x the number of lines of C than approach 1, and doesn’t have the touch of custom assembly. The results really opened my eyes. Unoptimized, it yields 17 instructions, same as unoptimized inline assembly. Optimized, it gets quashed down to an unbelievable 8 instructions. That’s 5 times fewer than the single line of C tertiary operator and 3 quicker than optimized inline assembly. How was this possible?

THE ANALYSIS

Remember the slave compiler who optimizes everything for us? Turns out he’s not such a bad guy after all, just gets a little confused sometimes. The inline assembly to “improve” the calculation of an absolute value between an intermediate calculation and it’s use just below interfered with the compiler’s optimization heuristic. Comparing disassemblies between optimized approaches 2 (3-line C) and 3 (inline assembly) showed that the three extra operations were from moving the value of diff to and from eax. Without the inline assembly mucking things up, the compiler could see that that was all it needed to work with and just kept it in eax. It even kept it in after calculating the absolute value for use in computations just below. My two hours of careful work and research had actually just gummed up the works. And while 3 or even 30 operations don’t seem like much, it matters inside a block of code which needs to operate 100,000,000 times in 1/60 of a second.

Speaking of those iterations, turns out even those tweaks didn’t do much. It still ground along until I took a look at the bigger picture. Compared to the 2 hours optimizing the inner loop, the 30 minutes I spent changing the algorithm not only made the game playable, but it’s now humming along at 58 fps. So what did I learn?

Work with the compiler, not against it. It makes a great development tool.
If you have performance-critical code and can’t figure out which approach to take, take them all and investigate how they turn out.
Use the new kid on the block: C. The end result will probably be more elegant, quick and portable than if ASM gets involved.
Favour global optimizations over local optimizations EVERY TIME.

Four lessons in two and a half hours, and I feel like I’ve only just scratched the surface.

Comments

2 responses to “Assembling a Lower Level understanding”

2010-11-24

Cathy

just curious is there no way to reduce the complexity of your loop? n^2 is a lot no matter how you optimize the inside of the loop.

Loading…

Reply
2010-11-24

Cathy

ignore my previous comment, read too fast over your last paragraph 😛

Loading…

Reply

Assembling a Lower Level understanding

THE REHEARSAL

THE TESTS

THE ANALYSIS

Share this:

Like this:

Comments

2 responses to “Assembling a Lower Level understanding”

Leave a ReplyCancel reply

More posts

Scaling NUnit Tests Through Sharding

Centralized Exception Reporting in NUnit

Simple test hosts in .NET using NUnitlite and file-based apps.

Streamline .NET Testing with NUnitLite

Discover more from Software by Steven