|
Author |
Message |
z5_
Member |
This thread is meant as a collection of tip and tricks. It's not really meant for big discussions or questions. More as a collection of useful things to know when coding in assembler. I'll start with posting some tips from other threads.
|
z5_
Member |
I'll start by pasting this very useful text from Kalms (from the general questions thread):
Generally, the .b / .w / .l operation determines the size of the *operation* but not necessarily the size of the *operands*.
Examples:
An add.w #1,d0 will perform a 16-bit addition, with one 16-bit immediate operand, and one 16-bit register operand (the lower half of d0; upper half will remain untouched).
An addq.w #1,d0 will perform a 16-bit addition, with one 3-bit immediate operand, and one 16-bit register operand.
An add.w (a0),d0 will perform a 16-bit addition, with one 16-bit memory operand, and one 16-bit register operand.
An add.w a0,d0 will perform a 16-bit addition, with two 16-bit register operands (lower half of a0/d0; only lower half of d0 gets updated).
The exception to this rule is operations with an address register as destination operand (add.w #1,a0 for instance). All operations against address registers be performed as 32-bit operations. The source operand must be either 16 or 32 bits. If the source operand is 16 bits, it will be sign-extended to 32 bits before the 32-bit operation takes place. All 32 bits from the address register will be used in the operation (both for reading & writing).
Examples:
An add.w #1,a0 will perform a 32-bit addition, with a 16-bit immediate operand (sign-extended to 32 bits) and a 32-bit register operand. The result is stored in a0.
An addq.w #1,a0 will perform a 32-bit addition, with a 3-bit immediate operand (_zero_extended to 32 bits -- it's stated in the documentation for addq) and a 32-bit register operand. The result is stored in a0.
An addq.l #1,a0 will perform a 32-bit addition, with a 32-bit immediate operand and a 32-bit register operand. The result is stored in a0.
An add.w d0,a0 will perform a 32-bit addition, with a 16-bit register operand (lower half of d0) which gets sign-extended to 32 bits, and a 32-bit register operand. The result is stored in a0.
So:
If you want to update/change a pointer, and it is not kept in an address register, you want to perform a full 32-bit operation on the pointer.
If the pointer is in an address register, you can do either .w or .l suffix(depending on whether the 1st source operand can go with 16bit or needs to be 32bit), since the operation itself will always be 32-bit.
If it is anything else (an index, a counter, etc etc), pick .b .w .l depending on which size you want the *operation* to be. A counter that is going to count to 10000, increasing by 1 each time, needs to be updated using a 16-bit or 32-bit addition; therefore, you will do "add.w #1 / addq.w #1 / add.l #1 / addq.l #1" to advance it.
If you would do "add.b #1" to increase the counter, its value would wrap around at 256.
|
z5_
Member |
This interesting bit of info came from Blueberry. It's about how dbra handles loopcounters (in this case d7). The lesson to be learnt: always handle loopcounters as word, not as byte.
The dbra instruction counts word-sized. After the final iteration of a dbra loop, the counter register contains the value $ffff.
If you only write to the lower byte of d7, then the upper byte of the word will, on the next iteration of the d6 loop, still contain $ff, causing the d7 loop to loop $ff0a (65290) times instead of ten.
|
d0DgE
Member |
therefore, you will do "add.w #1 / addq.w #1 / add.l #1 / addq.l #1" to advance it.
For the matter of add/sub immediate values larger than 3 bit,
you might also consider using the immediate versions of
add/sub -> addi.x / subi.x #<val>
Same is available for compares, too -> cmpi.x #<val>
And there's a neat trick to do avoid word moves on negative values:
instead of move.w #$fffe,dx you might as well type moveq #-2,dx
|
z5_
Member |
@d0DgE:
i think i read somewhere that all that ***i instruction (addi, subi,...) are automatically done by the compiler?
|
z5_
Member |
Another one for the "must remember at all times" (from Kalms): register allocation. This is especially useful for optimising innerloops. The thing to remember here: try to stuff as much as possible into dataregisters before the innerloop. If you run out of dataregisters, then use adress registers (some things need to be kept in mind when using adress registers!). Try to avoid using variables in an innerloop altogether.
Register allocation is the process of figuring out "which variables shall I keep in which registers?". The reason why you keep variables in registers is because it can make your code execute faster. Register allocation can be thought of as a sort of "caching".
To make register allocation simple, do the following:
1. Look through a section of your code and try to identify some variables that you access many times. During the setup calculations, you seem to touch x0,x1,y0,y1,dx,dy several times.
2. Assign one register to hold the value of one matching variable. Perhaps: d0 - x0, d1 - y0, etc.
3. At some place in the code, where you have loaded the variables into corresponding registers, make a comment which shows exactly which variables are kept in which registers at that place in the code. Example:
lea screenbase,a0
move.w x0(pc),d0
move.w y0(pc),d1
; d0.w x0
; d1.w y0
; a0 screenbase
. .. use d0, d1 instead of x0, y0 here ...
4. When the registers<->variables mapping is no longer valid (perhaps because have re-used a few registers for other purposes), place a new comment in your code.
Once you have done such a register allocation, try to avoid loading/storing those registers unnecessarily. Generally, you would need to store out the value in a register only once: just before you are about to re-use the register for caching another variable (or for performing some temporary calculations).
|
z5_
Member |
And this one explains the differences between using adress registers and dataregisters (see post above about register allocation as to why you should use adress registers if you run out of dataregisters) (by Kalms again):
Operations which use an address register as source and/or destination can only be performed on word- or longword-sized operands (i.e. no "move.b #1,a0").
The MOVE/ADD/SUB etc machine-instructions do not support having address registers as destination.
Instead, the MOVEA/ADDA/SUBA etc instructions have to be used.
Your assembler will do the translation for you, so "move.w d0,a0" will be translated into "movea.w d0,a0" during assembly, without any complaints.
The reason why it is important to know the distinction between xxx and xxxA instructions is:
* xxxA instructions can only work with word- and longword-sized operands:
move.b d0,a0 ; assembly will fail
move.w d0,a0 ; OK
move.l d0,a0 ; OK
* An xxxA.w instruction will sign-extend the 1st operand 16->32bit, and after that the rest of the operation will be performed in 32 bits:
; a0 = $12345678
; d0 = $abcdffff
add.w d0,a0
; a0 = $12345677
move.w d0,a0
; a0 = $ffffffff
move.l d0,a0
; a0 = $abcdffff
* xxxA instructions do not affect flags:
cmp.w #3,d0
sub.l a0,a1
beq.s blah ; jump if d0 == 3
Also, there are no xxxA versions of any logical/bitwise operations (such as Scc, AND/OR/EOR/NOT, shifts/rotates). You can generally just do MOVE,ADD,SUB to the address registers.
|
d0DgE
Member |
while we're at additions... can somebody please fill me in what's the deal with that infamous ADDX ?
|
noname
Member |
Addx adds one extra if the carry flag has been set. This can be used to save an instructions while doing texture mapping innerloops with fixed point integers where the fractional part of "u" is stored in the higher word of "v" and the fractional part of "v" is stored in the higher word of "u". It is a bit fiddly to set up in the first place but you can ultimately add $8000000 + $8000000 and get the correct result of $00000001. This allows making 5cmd innerloops with precision which is the standard rotator/texture-mapper code.
|
ZEROblue
Member |
Actually it's the eXtend bit which is added in, which is separate from the carry in the status register. Many instructions, but not all, which set the carry will set the extend to the same value, though.
Probably most useful for doing round-correction and such. Works for left direction bitmap scrolling too :)
|
noname
Member |
Zeroblue and my debugger were right, it is the x-flag, hence the name ;)
Here is a piece of code that should demonstrate the use of addx. It uses normal 16:16 fixed point arithmetics (not the more complicated mixed setup I mentioned in my previous post) and uses them to add 1.5 + 0.25 + 0.25 to get to the correct result of 2. Not very exiting but it will make addx clear, I hope. Run it through a debugger to see what happens exactly.
; set up
moveq.l #0,d7 ;0 for addx
move.l #$80000001,d0 ;1.5
move.l #$40000000 ,d1 ;0.25
move.l #$40000000,d2 ;0.25
; d0: 1.5
; d1: 0.25
; d2: 0.25
; d0+d1+d2 must give 2
; 1.5 + 0.25 = 1.75 (#$c0000001)
add.l d0,d1
addx.l d7,d1
; adds nothing as no overflow occured
; (and d7 is 0)
; 1.75 + 0.25 are expected to be 2
add.l d1,d2
; but you only get 1 and the x-flag set
; addx will be very handy in this case
; as it adds one extra
; if the x-flag has been set
addx.l d7,d2
;correct result of 2 is in d2
rts
|
d0DgE
Member |
thanks a lot, noname. So I see I'm far off using this ATM :)
|
TheDarkCoder
Member |
In some thread, someone, (I think Kalms) wrote some nice asm tricks to avoid Bcc instructions in certain situation, such as clumping the value of a byte, or taking the maximum between 2 bytes...
I think those triks shoul be reposted here, but I don't remember where they are...
|
ZEROblue
Member |
It's probably in the Coding tutorial: questions-thread. Here are some more trix to do this in unsigned notation:
add d0, d1 ; Clamp D1+D0 to $FF
subx d0, d0
or d0, d1
sub d0, d1 ; Clamp D1-D0 to $00
subx d0, d0
not d0
and d0, d1
sub d0, d1 ; Return maximum in D0
subx d2, d2
not d2
and d2, d1
add d1, d0
sub d0, d1 ; Return minimum in D0
subx d2, d2
and d2, d1
add d1, d0
sub d0, d1 ; Generalization of the above
subx d2, d2
and d1, d2
add d0, d1
sub d2, d1 ; Max in D1
add d2, d0 ; Min in D0
All code is interchangeable between B, W and L size. For B size you can optimize by replacing any subx- not construct with a single scc instruction.
Edit: for signed numbers you can just flip the highest bit of your operands by means of eor/add/sub before and after calculating.
|
|
|