Assembly Hacking Tips

RuneLancer · Jul 8, 2008

No, this is not an assembly tutorial.

I'm opening this thread to serve as a repository of tips and tricks, questions and answers, and otherwise general assembly knowledge. So folks, dump your tricks into this thread.

---

Translating some common structures into assembly

Pointer

Code:

C:   var = &pointer ; var is a pointer to an integer; pointer is an integer
     var = *pointer ; var is an integer

x86: mov [var], pointer ; usually, you'd put in the actual offset of pointer
     mov [var], [pointer] ; again, pointer would be an offset

Array

Code:

C:   var = array[index]

x86: mov edx, array
     mov eax, [array + index*size] ; size = the size of an array item; 4 bytes for integers
     mov [var], eax

Object

Code:

C:   var = object.member

x86: mov edx, object
     mov eax, [object + offset] ; offset points to a member in object
     mov [var], eax

Of course, you can change the registers used around to fit in with your code.

---

The CDQ instruction

If you've read through some of Cave Story's code, you'll notice an instruction, CDQ, being used frequently when working with positions (such as an NPC's position.) This instruction can be very confusing because there's no direct equivalent for it in most (if not all) programming languages. It "sign extends" eax into edx. This can be confusing, particularly when applied to the following context:

Code:

mov eax,[SomeYPosition]
sub eax,[SomeYVelocity]
cdq
and edx,000001FF
add eax,edx
mov ecx,eax

The hell's goin' on in ecx now!? If you've managed to make sense of the cdq description and realized it sets edx to -1 if eax is negative and to 0 if it's positive, you're probably even more confused as to what this code does.

CDQ takes what's in eax and simply copies the highest bit all over edx. So if bit 32 of eax is 1, edx will be 111111... and if it's 0, edx will be 00000... If you know your binary, you'll remember that that bit is used to indicate if the number is positive or negative, so CDQ really just copies eax's sign over to edx.

In this case, it means that if eax is a negative number, edx will be set to 0xFFFFFFFF. If you AND that with 0x000001FF, you get 0x000001FF back. If you AND 0x00000000 with 0x000001FF, you still get 0x00000000. It's a clever way of setting edx to some value if eax is negative. We then add it back to eax. Meaning, CDQ is used to simulate some sort of "if eax < 0, then eax = eax + some_value" command over a few lines.

CDQ is called before dividing. If you see an idiv right after a CDQ, the CDQ's just there to keep things sane.

---

Little bits and pieces...

The fastest way to set a register to 0 is to xor it with itself.

Code:

xor eax,eax

xor puts out a 0 when both inputs are 1, so if you xor a number with itself, it will come out as 0. This is a binary operation so it's VERY fast, too, compared to MOVing 0 into the register.

You can quickly multiple/divide powers of 2 using bitshifts (also a binary operation, so it's WAY faster than an imul/idiv) sar (shift arithmetically right) is like a "divide by 2" instruction. If you call "sar eax,3" then eax is divided by 2, then divided by 2 again (4), and divided by 2 a third time (8). In other words, this is like dividing by 8 (2*2*2 = 8). You can multiply by using the opposing instruction, shl (shift left)

Remember your arithmetics classes? Thought all that funky crap with variables would never come in handy? Guess again. While shifting only allows you to multiply/divide by powers of 2, there's a fun little theorem that goes...

Code:

a * x + a * y = a * (x+y)

You can apply this if you can decompose a number into two powers of two (any more and you'd be better off with a regular imul/idiv in terms of speed.) For instance...

Code:

; We have 3. We want to multiply it by 24.
; 24 is not a power of 2. Darn.
; But, 8 + 16 = 24. Both 8 (2^3) and 16 (2^4) are powers of 2.
; 3 * 24 = 3 * 8 + 3 * 16
mov edx, 3
mov eax, edx
shl edx, 3 ; edx = 3 * 8
shl eax, 4 ; eax = 3 * 16
add eax, edx ; eax = 3 * 16 + 3 * 8 now.
; and eax contains our answer.

---

Here's an advanced optimization trick you can use. If you don't know what you're doing though, don't worry about it. Chances are, unless you're rewriting part of the engine, you won't have to optimize your code much.

It's called pipelining. Most processors nowadays are capable of this funky little optimization where instead of just waiting, say, 4 cycles while it completes an instruction, it'll start handling the next instruction BEFORE finishing the first one complete. How does the processor do this? Well, an instruction like "MOV" takes a lot more work than you'd expect: the processor has to fetch the instruction, decode it, run it, etc, etc.. And there's no reason why the little bits of circuitry that busy themselves with fetching an instruction can't start fetching the next instruction once the current instruction has gone through that part.

There's a catch: the instructions can't be co-dependant. For instance...

Code:

mov eax, 0x03
mov ecx, eax

The second instruction has to depend on the first instruction because until we write 3 to eax, we can't write eax to ecx. However...

Code:

mov edx, 3
mov eax, edx
shl edx, 3
shl eax, 4
add eax, edx

Pipelining can occure with the second shift. It doesn't depend on the previous instruction, so it can enter the fetch queue the cycle following the first shift being picked up. The total execution time of those two instructions is the time it'd take to execute one shift, plus one measly extra cycle. On the other hand...

Code:

mov edx, 3
mov ecx, edx
shl ecx, 3
mov eax, edx
shl eax, 4
add eax, ecx

...in this case, both shift instructions are dependant on the preceding mov instruction. No speedup occures here (in fact, the extra mov instruction will add even more overhead to the code.)

The bottom line is, if you need speed, see if you can rearrange your code in a way that takes advantage of pipelining. But don't go off wasting hours on tweaking your hacks for speed - only optimize when it becomes necessary. And keep this in mind: tweaking your code will get you a good 10-25% increase in speed. Rewriting your code to use a better algorithm will get you a 25-75% increase in speed.

Lace · Nov 28, 2010

Hacking Tips

So this thread has been merged, at my request, with an earlier thread. The premise still stands - post tips about things in x86 assembly!

The thing I discovered that prompted me to do this is fairly simple:
LEA IS YOUR BEST FRIEND!

The lea command loads into the first argument the address of the second argument. That is to say, lea ecx, [419c60] is the equivalent of mov ecx, 419c60. This is generally used in CS for the purpose of passing local variables into a larger function (since it gets the absolute address, there isn't any of this [ebp-xx] shite that another function can't access), most often in the case of the graphics display function. These will often take the form of something like: lea edx,[eax*8+ecx]. Note that since pointers can contain basic arithmatic, this single operation is the equivalent of mul-ing eax by 8. adding it to ecx, and moving the entire thing into edx. For comparison:

Code:

lea edx,[eax*8+ecx]
8D 14 C1

versus

Code:

imul eax,eax,8
add eax,ecx
mov edx,eax
6B C0 08 01 C8 89 C2

and in truth it's even more than that, because eax is never changed in the first function, so to create similar function, it would need two more bytes (to push and pop eax).

This part is really what we care about - in a world where we need every byte of space that we can squeeze out of a function, lea can make our code A THIRD the size it would be otherwise. Holy fucking cows.

So that's what I'm here to tell you. I don't know if you knew this already. Nonetheless, use it as you will, cause it's a demn powerful tool.

Post your tips as well.

Noxid · Nov 28, 2010

I have a bunch of tips I could share, I'll have to think of some first (I'll edit this)

1) XOR any register with itself to set it to zero. This is infinitely helpful.

2) Setting low numbers
Requiring most space:
MOV EAX, 12

Next least:
XOR EAX, EAX
ADD EAX, 12

Tiniest:
XOR EAX, EAX
MOV AL, 12

3) Frame calculation
Specifically for entities, this can eliminate the need for framerects - just calculate on the fly!
Saves oodles of space.
More details here: http://www.cavestory.org/forums/threads/2227/

4) use SHL for multiplication in powers of two; it's faster than IMUL
i.e. SHL EAX, 3
is equivalent to
IMUL EAX, EAX, 8

Carrotlord · Nov 28, 2010

Re: Hacking Tips

Dunno if this is obvious or not, but segment register SS works best with stack registers, and DS works best with regular registers.

For pointers:

MOV DWORD PTR SS:[EAX+20],30AB40
is one byte longer than
MOV DWORD PTR DS:[EAX+20],30AB40

while

MOV DWORD PTR SS:[EBP+20],30AB40
is one byte shorter than
MOV DWORD PTR DS:[EBP+20],30AB40

This is cause instructions get prefix bytes appended to their fronts if the default seg reg isn't used.

Yep, can't think of much right now..

Noxid · Nov 28, 2010

Ah, that reminds me of another one. The EAX register is the 'accumulator', and is optimised for doing certain operations like add and imul with large numbers. The gains are miniscule but do exist.

The most effective way to get the absolute value of a register is as follows:

put the value in EAX;

CDQ
XOR EAX, EDX
SUB EAX, EDX