Jul 8, 2008 at 5:45 AM
Join Date: Jun 18, 2006
Location: Montreal, Canada
Posts: 581
Age: 40
No, this is not an assembly tutorial.
I'm opening this thread to serve as a repository of tips and tricks, questions and answers, and otherwise general assembly knowledge. So folks, dump your tricks into this thread.
---
Translating some common structures into assembly
Pointer
Array
Object
Of course, you can change the registers used around to fit in with your code.
---
The CDQ instruction
If you've read through some of Cave Story's code, you'll notice an instruction, CDQ, being used frequently when working with positions (such as an NPC's position.) This instruction can be very confusing because there's no direct equivalent for it in most (if not all) programming languages. It "sign extends" eax into edx. This can be confusing, particularly when applied to the following context:
The hell's goin' on in ecx now!? If you've managed to make sense of the cdq description and realized it sets edx to -1 if eax is negative and to 0 if it's positive, you're probably even more confused as to what this code does.
CDQ takes what's in eax and simply copies the highest bit all over edx. So if bit 32 of eax is 1, edx will be 111111... and if it's 0, edx will be 00000... If you know your binary, you'll remember that that bit is used to indicate if the number is positive or negative, so CDQ really just copies eax's sign over to edx.
In this case, it means that if eax is a negative number, edx will be set to 0xFFFFFFFF. If you AND that with 0x000001FF, you get 0x000001FF back. If you AND 0x00000000 with 0x000001FF, you still get 0x00000000. It's a clever way of setting edx to some value if eax is negative. We then add it back to eax. Meaning, CDQ is used to simulate some sort of "if eax < 0, then eax = eax + some_value" command over a few lines.
CDQ is called before dividing. If you see an idiv right after a CDQ, the CDQ's just there to keep things sane.
---
Little bits and pieces...
The fastest way to set a register to 0 is to xor it with itself.
xor puts out a 0 when both inputs are 1, so if you xor a number with itself, it will come out as 0. This is a binary operation so it's VERY fast, too, compared to MOVing 0 into the register.
You can quickly multiple/divide powers of 2 using bitshifts (also a binary operation, so it's WAY faster than an imul/idiv) sar (shift arithmetically right) is like a "divide by 2" instruction. If you call "sar eax,3" then eax is divided by 2, then divided by 2 again (4), and divided by 2 a third time (8). In other words, this is like dividing by 8 (2*2*2 = 8). You can multiply by using the opposing instruction, shl (shift left)
Remember your arithmetics classes? Thought all that funky crap with variables would never come in handy? Guess again. While shifting only allows you to multiply/divide by powers of 2, there's a fun little theorem that goes...
You can apply this if you can decompose a number into two powers of two (any more and you'd be better off with a regular imul/idiv in terms of speed.) For instance...
---
Here's an advanced optimization trick you can use. If you don't know what you're doing though, don't worry about it. Chances are, unless you're rewriting part of the engine, you won't have to optimize your code much.
It's called pipelining. Most processors nowadays are capable of this funky little optimization where instead of just waiting, say, 4 cycles while it completes an instruction, it'll start handling the next instruction BEFORE finishing the first one complete. How does the processor do this? Well, an instruction like "MOV" takes a lot more work than you'd expect: the processor has to fetch the instruction, decode it, run it, etc, etc.. And there's no reason why the little bits of circuitry that busy themselves with fetching an instruction can't start fetching the next instruction once the current instruction has gone through that part.
There's a catch: the instructions can't be co-dependant. For instance...
The second instruction has to depend on the first instruction because until we write 3 to eax, we can't write eax to ecx. However...
Pipelining can occure with the second shift. It doesn't depend on the previous instruction, so it can enter the fetch queue the cycle following the first shift being picked up. The total execution time of those two instructions is the time it'd take to execute one shift, plus one measly extra cycle. On the other hand...
...in this case, both shift instructions are dependant on the preceding mov instruction. No speedup occures here (in fact, the extra mov instruction will add even more overhead to the code.)
The bottom line is, if you need speed, see if you can rearrange your code in a way that takes advantage of pipelining. But don't go off wasting hours on tweaking your hacks for speed - only optimize when it becomes necessary. And keep this in mind: tweaking your code will get you a good 10-25% increase in speed. Rewriting your code to use a better algorithm will get you a 25-75% increase in speed.
I'm opening this thread to serve as a repository of tips and tricks, questions and answers, and otherwise general assembly knowledge. So folks, dump your tricks into this thread.
---
Translating some common structures into assembly
Pointer
Code:
C: var = &pointer ; var is a pointer to an integer; pointer is an integer
var = *pointer ; var is an integer
x86: mov [var], pointer ; usually, you'd put in the actual offset of pointer
mov [var], [pointer] ; again, pointer would be an offset
Array
Code:
C: var = array[index]
x86: mov edx, array
mov eax, [array + index*size] ; size = the size of an array item; 4 bytes for integers
mov [var], eax
Object
Code:
C: var = object.member
x86: mov edx, object
mov eax, [object + offset] ; offset points to a member in object
mov [var], eax
Of course, you can change the registers used around to fit in with your code.
---
The CDQ instruction
If you've read through some of Cave Story's code, you'll notice an instruction, CDQ, being used frequently when working with positions (such as an NPC's position.) This instruction can be very confusing because there's no direct equivalent for it in most (if not all) programming languages. It "sign extends" eax into edx. This can be confusing, particularly when applied to the following context:
Code:
mov eax,[SomeYPosition]
sub eax,[SomeYVelocity]
cdq
and edx,000001FF
add eax,edx
mov ecx,eax
The hell's goin' on in ecx now!? If you've managed to make sense of the cdq description and realized it sets edx to -1 if eax is negative and to 0 if it's positive, you're probably even more confused as to what this code does.
CDQ takes what's in eax and simply copies the highest bit all over edx. So if bit 32 of eax is 1, edx will be 111111... and if it's 0, edx will be 00000... If you know your binary, you'll remember that that bit is used to indicate if the number is positive or negative, so CDQ really just copies eax's sign over to edx.
In this case, it means that if eax is a negative number, edx will be set to 0xFFFFFFFF. If you AND that with 0x000001FF, you get 0x000001FF back. If you AND 0x00000000 with 0x000001FF, you still get 0x00000000. It's a clever way of setting edx to some value if eax is negative. We then add it back to eax. Meaning, CDQ is used to simulate some sort of "if eax < 0, then eax = eax + some_value" command over a few lines.
CDQ is called before dividing. If you see an idiv right after a CDQ, the CDQ's just there to keep things sane.
---
Little bits and pieces...
The fastest way to set a register to 0 is to xor it with itself.
Code:
xor eax,eax
You can quickly multiple/divide powers of 2 using bitshifts (also a binary operation, so it's WAY faster than an imul/idiv) sar (shift arithmetically right) is like a "divide by 2" instruction. If you call "sar eax,3" then eax is divided by 2, then divided by 2 again (4), and divided by 2 a third time (8). In other words, this is like dividing by 8 (2*2*2 = 8). You can multiply by using the opposing instruction, shl (shift left)
Remember your arithmetics classes? Thought all that funky crap with variables would never come in handy? Guess again. While shifting only allows you to multiply/divide by powers of 2, there's a fun little theorem that goes...
Code:
a * x + a * y = a * (x+y)
Code:
; We have 3. We want to multiply it by 24.
; 24 is not a power of 2. Darn.
; But, 8 + 16 = 24. Both 8 (2^3) and 16 (2^4) are powers of 2.
; 3 * 24 = 3 * 8 + 3 * 16
mov edx, 3
mov eax, edx
shl edx, 3 ; edx = 3 * 8
shl eax, 4 ; eax = 3 * 16
add eax, edx ; eax = 3 * 16 + 3 * 8 now.
; and eax contains our answer.
---
Here's an advanced optimization trick you can use. If you don't know what you're doing though, don't worry about it. Chances are, unless you're rewriting part of the engine, you won't have to optimize your code much.
It's called pipelining. Most processors nowadays are capable of this funky little optimization where instead of just waiting, say, 4 cycles while it completes an instruction, it'll start handling the next instruction BEFORE finishing the first one complete. How does the processor do this? Well, an instruction like "MOV" takes a lot more work than you'd expect: the processor has to fetch the instruction, decode it, run it, etc, etc.. And there's no reason why the little bits of circuitry that busy themselves with fetching an instruction can't start fetching the next instruction once the current instruction has gone through that part.
There's a catch: the instructions can't be co-dependant. For instance...
Code:
mov eax, 0x03
mov ecx, eax
Code:
mov edx, 3
mov eax, edx
shl edx, 3
shl eax, 4
add eax, edx
Pipelining can occure with the second shift. It doesn't depend on the previous instruction, so it can enter the fetch queue the cycle following the first shift being picked up. The total execution time of those two instructions is the time it'd take to execute one shift, plus one measly extra cycle. On the other hand...
Code:
mov edx, 3
mov ecx, edx
shl ecx, 3
mov eax, edx
shl eax, 4
add eax, ecx
...in this case, both shift instructions are dependant on the preceding mov instruction. No speedup occures here (in fact, the extra mov instruction will add even more overhead to the code.)
The bottom line is, if you need speed, see if you can rearrange your code in a way that takes advantage of pipelining. But don't go off wasting hours on tweaking your hacks for speed - only optimize when it becomes necessary. And keep this in mind: tweaking your code will get you a good 10-25% increase in speed. Rewriting your code to use a better algorithm will get you a 25-75% increase in speed.