A.D.A. Amiga Demoscene Archive

        Welcome guest!

  

  

  

log in with SceneID

  

Demos Amiga Demoscene Archive Forum / Coding / 18 bit tricks?

 

Author Message
noname
Member
#1 - Posted: 17 May 2009 21:05
Reply Quote
I am just playing around with additive blending in 18 bit RGBA. One of the bottlenecks in the current code is the range check for each pixel's colour values which makes sure that they do not exceed 63 ($3f).

Is there a better solution than doing byte-wise ANDs (with $40) in a seperate data register, followed by a conditional branch over a move.b #$3f,dn instruction when no overflow occured? I don't see anything obvious, but then again I didn't see the addx trick for mapping until Touchstone told me about it.
britelite
Member
#2 - Posted: 18 May 2009 14:07
Reply Quote
couldn't you calculate the blending with 8bits using the addx-trick, and then just shift the bytes to 6bits?
Kalms
Member
#3 - Posted: 18 May 2009 16:48
Reply Quote
	; add pair -- each component is in range 0.. 
$3f add.l d1,d0 ; result is in range 0..$7f, make temp copy move.l d0,d2 ; isolate overflow bits ; $00 = no overflow, $40 = overflow and.l #$40404040,d2 ; move overflow bits to bottom of each byte ; $00 = no overflow, $01 = overflow lsr.l #6,d2 ; set stop bits, and flip overflow bits ; $81 = no overflow, $80 = overflow eor.l #$81818181,d2 ; convert overflow bits to bitmasks ; $80 = no overflow, $7f = overflow sub.l #$01010101,d2 ; saturate components which have overflowed or.l d2,d0 ; optional: ensure result is in 0..$3f range and.l #$3f3f3f3f,d0
noname
Member
#4 - Posted: 18 May 2009 20:53
Reply Quote
Magic - that just gave me a good 20% extra speed. I knew there would be such a solution. Thanks Kalms (and Britelite)!

Next goal is to optimize my caching strategy while drawing. Its all new territory, but hopefully this will gain me even more speed.
sp_
Member
#5 - Posted: 12 Oct 2009 20:26 - Edited
Reply Quote
Kalms routine is 8 cycles. Here is a 7 cycle version (Mc68060):

add.l d1,d0
move.l d0,d2
sub.l #$7f7f7f7f,d2
and.l #$3f3f3f3f,d0
and.l #$1010101,d2
muls.l #$3f,d2
or.l d2,d0
noname
Member
#6 - Posted: 12 Oct 2009 21:45
Reply Quote
Oh magic! Hopefully I will find some time for playing with the Amiga again in the not too distant future.
Kalms
Member
#7 - Posted: 12 Oct 2009 23:00
Reply Quote
sp, that latest version is

1) broken (try with d0 = $0000007f, d1 = $00000002)
and
2) slower when pairing (for each MULS you can do four simple integer arithmetic ops)
sp_
Member
#8 - Posted: 13 Oct 2009 09:04
Reply Quote
I wonder how you can get two 6bit numbers added to become $7f. $3f+$3f = $7e wich is the maximum number...

But you are right. The sub should be replaced by a shift (same speed)


add.l d1,d0
move.l d0,d2
lsr.l #6,d2
and.l #$3f3f3f3f,d0
and.l #$1010101,d2
muls.l #$3f,d2
or.l d2,d0
Kalms
Member
#9 - Posted: 13 Oct 2009 14:42
Reply Quote
Eh, right. I mixed up $3f and $7f there.
Azure
Member
#10 - Posted: 14 Oct 2009 20:36
Reply Quote
real coders do it in C

ta=*tdes+*tso1++;
tb=ta&0x40404040;
tc=tb>>6;
tb-=tc;
*tdes++=(ta|tb)&0x3f3f3f3f;

(from the MDIV2 source code)

Requires fewer constants, but needs one more temp reg.
sp_
Member
#11 - Posted: 14 Oct 2009 21:47 - Edited
Reply Quote
The C-code translated to asm is 8 cycles and use one more register.

add.l d0,d1
move.l d1,d3
and.l #$40404040,d1
move.l d1,d2
lsr.l #6,d2
sub.l d2,d1
or.l d3,d1
and.l #$3f3f3f3f,d1

My 7 cycle is still the fastest. :D

ok, By using superscalar (pairing two longword pixels in one go) . Kalms method givel 4 cycles per longword and my routine 5.
..
sp_
Member
#12 - Posted: 15 Oct 2009 22:03 - Edited
Reply Quote
On 030 I would have done the following:

add.l d0,d1
move.l (a0,d1.w),d0
swap d1
or.l (a1,d1.w),d0

The lookup can scrable the bits so that the c2p can run at copyspeed.
Blueberry
Member
#13 - Posted: 2 Nov 2009 00:26
Reply Quote
You can speed it up even more by only keeping track of whether the 6 bits have overflowed, while making sure never to overflow all 8 bits, but postponing the actual saturation until c2p time.

This can be done using 4 instructions (thus, 2 cycles per longword if you pair them) after the blending add, like this:

move.l d0,d2
and.l #$80808080,d2
lsr.l #1,d2
sub.l d2,d0

To see how this works, consider the overflow bits (two most significant bits of each byte): Before any 6-bit overflow, the bits are 00 - these are not changed by the above instructions. After the first overflow, the bits are 01 - also unchanged. At the second overflow, the bits become 10 - the 1-bit is extracted, shifted down and subtracted, yielding 01. Thus, the overflow bits stay at 01 after the first overflow, no matter how many overflows occur.

This is exactly the overflow bit pattern expected by all the saturation code pieces given in this thread. So you take any one of them and stick it in your c2p right after loading the chunky data into registers. If you don't have any other postprocessing code in your c2p, the conversion code will take far less time than the writing to chip memory, so the extra saturation code will be essentially free.
Kalms
Member
#14 - Posted: 2 Nov 2009 00:58
Reply Quote
Neat!

(It took me a minute or two to figure out that what Blueberry is suggesting above is to use the 4-instruction sequence for avoiding overflow when adding 3+ layers, and then run the "standard" saturation code afterward [possibly inside the c2p].)

So it would be used something like this when combining 6 layers:

	move.l	(a0)+,d0
add.l (a1)+,d0
add.l (a2)+,d0
AVOID_OVERFLOW
add.l (a3)+,d0
AVOID_OVERFLOW
add.l (a4)+,d0
AVOID_OVERFLOW
add.l (a4)+,d0
AVOID_OVERFLOW
SATURATE
move.l d0,(a6)+


... and then the SATURATE (and perhaps the last AVOID_OVERFLOW operation as well) can be done during c2p processing.

... someone correct me if I'm wrong please.
Blueberry
Member
#15 - Posted: 2 Nov 2009 10:51
Reply Quote
I was thinking more of the case where you do an unknown number of blends for each pixel, and where the individual blends are independent, for instance when blending lots of particles. The assumption here is that you don't know whether a blend operation is the first one for a pixel, so you need to always do the overflow avoid sequence.

If, as in your example, you know how many blends come before and after each blend for this particular pixel, then some of the overflow avoid sequences can indeed be left out. You can leave out the first, as you did, and if the saturation code can handle arbitrary overflow bit configurations (i.e. it saturates on 01, 10 and 11), then the two last ones can also be left out.

Such a saturation operation costs just two more instructions. Instead of

move.l d0,d2
and.l #$40404040,d2

we have

move.l d0,d2
lsr.l #1,d2
or.l d0,d2
and.l #$40404040,d2
Kalms
Member
#16 - Posted: 3 Nov 2009 03:21
Reply Quote
Interesting. Thanks!

 

  Please log in to comment

  

  

  

 

A.D.A. Amiga Demoscene Archive, Version 3.0