
Author 
Message 
noname
Member 
I am just playing around with additive blending in 18 bit RGBA. One of the bottlenecks in the current code is the range check for each pixel's colour values which makes sure that they do not exceed 63 ($3f).
Is there a better solution than doing bytewise ANDs (with $40) in a seperate data register, followed by a conditional branch over a move.b #$3f,dn instruction when no overflow occured? I don't see anything obvious, but then again I didn't see the addx trick for mapping until Touchstone told me about it.

britelite
Member 
couldn't you calculate the blending with 8bits using the addxtrick, and then just shift the bytes to 6bits?

Kalms
Member 
; add pair  each component is in range 0.. $3f
add.l d1,d0
; result is in range 0..$7f, make temp copy
move.l d0,d2
; isolate overflow bits
; $00 = no overflow, $40 = overflow
and.l #$40404040,d2
; move overflow bits to bottom of each byte
; $00 = no overflow, $01 = overflow
lsr.l #6,d2
; set stop bits, and flip overflow bits
; $81 = no overflow, $80 = overflow
eor.l #$81818181,d2
; convert overflow bits to bitmasks
; $80 = no overflow, $7f = overflow
sub.l #$01010101,d2
; saturate components which have overflowed
or.l d2,d0
; optional: ensure result is in 0..$3f range
and.l #$3f3f3f3f,d0

noname
Member 
Magic  that just gave me a good 20% extra speed. I knew there would be such a solution. Thanks Kalms (and Britelite)!
Next goal is to optimize my caching strategy while drawing. Its all new territory, but hopefully this will gain me even more speed.

sp_
Member 
Kalms routine is 8 cycles. Here is a 7 cycle version (Mc68060):
add.l d1,d0 move.l d0,d2 sub.l #$7f7f7f7f,d2 and.l #$3f3f3f3f,d0 and.l #$1010101,d2 muls.l #$3f,d2 or.l d2,d0

noname
Member 
Oh magic! Hopefully I will find some time for playing with the Amiga again in the not too distant future.

Kalms
Member 
sp, that latest version is
1) broken (try with d0 = $0000007f, d1 = $00000002) and 2) slower when pairing (for each MULS you can do four simple integer arithmetic ops)

sp_
Member 
I wonder how you can get two 6bit numbers added to become $7f. $3f+$3f = $7e wich is the maximum number...
But you are right. The sub should be replaced by a shift (same speed)
add.l d1,d0 move.l d0,d2 lsr.l #6,d2 and.l #$3f3f3f3f,d0 and.l #$1010101,d2 muls.l #$3f,d2 or.l d2,d0

Kalms
Member 
Eh, right. I mixed up $3f and $7f there.

Azure
Member 
real coders do it in C
ta=*tdes+*tso1++; tb=ta&0x40404040; tc=tb>>6; tb=tc; *tdes++=(tatb)&0x3f3f3f3f;
(from the MDIV2 source code)
Requires fewer constants, but needs one more temp reg.

sp_
Member 
The Ccode translated to asm is 8 cycles and use one more register.
add.l d0,d1 move.l d1,d3 and.l #$40404040,d1 move.l d1,d2 lsr.l #6,d2 sub.l d2,d1 or.l d3,d1 and.l #$3f3f3f3f,d1
My 7 cycle is still the fastest. :D
ok, By using superscalar (pairing two longword pixels in one go) . Kalms method givel 4 cycles per longword and my routine 5. ..

sp_
Member 
On 030 I would have done the following:
add.l d0,d1 move.l (a0,d1.w),d0 swap d1 or.l (a1,d1.w),d0
The lookup can scrable the bits so that the c2p can run at copyspeed.

Blueberry
Member 
You can speed it up even more by only keeping track of whether the 6 bits have overflowed, while making sure never to overflow all 8 bits, but postponing the actual saturation until c2p time.
This can be done using 4 instructions (thus, 2 cycles per longword if you pair them) after the blending add, like this:
move.l d0,d2 and.l #$80808080,d2 lsr.l #1,d2 sub.l d2,d0
To see how this works, consider the overflow bits (two most significant bits of each byte): Before any 6bit overflow, the bits are 00  these are not changed by the above instructions. After the first overflow, the bits are 01  also unchanged. At the second overflow, the bits become 10  the 1bit is extracted, shifted down and subtracted, yielding 01. Thus, the overflow bits stay at 01 after the first overflow, no matter how many overflows occur.
This is exactly the overflow bit pattern expected by all the saturation code pieces given in this thread. So you take any one of them and stick it in your c2p right after loading the chunky data into registers. If you don't have any other postprocessing code in your c2p, the conversion code will take far less time than the writing to chip memory, so the extra saturation code will be essentially free.

Kalms
Member 
Neat! (It took me a minute or two to figure out that what Blueberry is suggesting above is to use the 4instruction sequence for avoiding overflow when adding 3+ layers, and then run the "standard" saturation code afterward [possibly inside the c2p].) So it would be used something like this when combining 6 layers: move.l (a0)+,d0 add.l (a1)+,d0 add.l (a2)+,d0 AVOID_OVERFLOW add.l (a3)+,d0 AVOID_OVERFLOW add.l (a4)+,d0 AVOID_OVERFLOW add.l (a4)+,d0 AVOID_OVERFLOW SATURATE move.l d0,(a6)+ ... and then the SATURATE (and perhaps the last AVOID_OVERFLOW operation as well) can be done during c2p processing. ... someone correct me if I'm wrong please.

Blueberry
Member 
I was thinking more of the case where you do an unknown number of blends for each pixel, and where the individual blends are independent, for instance when blending lots of particles. The assumption here is that you don't know whether a blend operation is the first one for a pixel, so you need to always do the overflow avoid sequence.
If, as in your example, you know how many blends come before and after each blend for this particular pixel, then some of the overflow avoid sequences can indeed be left out. You can leave out the first, as you did, and if the saturation code can handle arbitrary overflow bit configurations (i.e. it saturates on 01, 10 and 11), then the two last ones can also be left out.
Such a saturation operation costs just two more instructions. Instead of
move.l d0,d2 and.l #$40404040,d2
we have
move.l d0,d2 lsr.l #1,d2 or.l d0,d2 and.l #$40404040,d2

Kalms
Member 
Interesting. Thanks!


