| 
		
   
| Author | 
Message | 
 
 
	
	noname 
		Member		 | 
	
	
	 
		I am just playing around with additive blending in 18 bit RGBA. One of the bottlenecks in the current code is the range check for each pixel's colour values which makes sure that they do not exceed 63 ($3f). 
 
 Is there a better solution than doing byte-wise ANDs (with $40) in a seperate data register, followed by a conditional branch over a move.b #$3f,dn instruction when no overflow occured? I don't see anything obvious, but then again I didn't see the addx trick for mapping until Touchstone told me about it.		 
	 | 
	 
 
	
	britelite 
		Member		 | 
	
	
	 
		couldn't you calculate the blending with 8bits using the addx-trick, and then just shift the bytes to 6bits?		 
	 | 
	 
 
	
	Kalms 
		Member		 | 
	
	
	
			; add pair -- each component is in range 0..  $3f 
	add.l	d1,d0
	; result is in range 0..$7f, make temp copy
	move.l	d0,d2
	; isolate overflow bits
	; $00 = no overflow, $40 = overflow
	and.l	#$40404040,d2
	; move overflow bits to bottom of each byte
	; $00 = no overflow, $01 = overflow
	lsr.l	#6,d2
	; set stop bits, and flip overflow bits
	; $81 = no overflow, $80 = overflow
	eor.l	#$81818181,d2
	; convert overflow bits to bitmasks
	; $80 = no overflow, $7f = overflow
	sub.l	#$01010101,d2
	; saturate components which have overflowed
	or.l	d2,d0
	; optional: ensure result is in 0..$3f range
	and.l	#$3f3f3f3f,d0
  		 
	 | 
	 
 
	
	noname 
		Member		 | 
	
	
	 
		Magic - that just gave me a good 20% extra speed. I knew there would be such a solution. Thanks Kalms (and Britelite)!
 
 Next goal is to optimize my caching strategy while drawing. Its all new territory, but hopefully this will gain me even more speed.		 
	 | 
	 
 
	
	sp_ 
		Member		 | 
	
	
	 
		Kalms routine is 8 cycles. Here is a 7 cycle  version (Mc68060):
  	add.l	d1,d0 	move.l	d0,d2 	sub.l	#$7f7f7f7f,d2 	and.l	#$3f3f3f3f,d0 	and.l	#$1010101,d2 	muls.l	#$3f,d2 	or.l	d2,d0		 
	 | 
	 
 
	
	noname 
		Member		 | 
	
	
	 
		Oh magic! Hopefully I will find some time for playing with the Amiga again in the not too distant future.		 
	 | 
	 
 
	
	Kalms 
		Member		 | 
	
	
	 
		sp, that latest version is
  1) broken (try with d0 = $0000007f, d1 = $00000002) and 2) slower when pairing (for each MULS you can do four simple integer arithmetic ops)		 
	 | 
	 
 
	
	sp_ 
		Member		 | 
	
	
	 
		I wonder how you can get two 6bit numbers added to become $7f. $3f+$3f = $7e wich is the maximum number...
  But you are right. The sub should be replaced by a shift  (same speed)
 
  add.l d1,d0 move.l d0,d2 lsr.l	#6,d2 and.l #$3f3f3f3f,d0 and.l #$1010101,d2 muls.l #$3f,d2 or.l d2,d0		 
	 | 
	 
 
	
	Kalms 
		Member		 | 
	
	
	 
		Eh, right. I mixed up $3f and $7f there.		 
	 | 
	 
 
	
	Azure 
		Member		 | 
	
	
	 
		real coders do it in C
                  ta=*tdes+*tso1++;                 tb=ta&0x40404040;                 tc=tb>>6;                 tb-=tc;                 *tdes++=(ta|tb)&0x3f3f3f3f;
  (from the MDIV2 source code)
  Requires fewer constants, but needs one more temp reg.		 
	 | 
	 
 
	
	sp_ 
		Member		 | 
	
	
	 
		The C-code translated to asm is 8 cycles and use one more register.
  	add.l	d0,d1 	move.l	d1,d3 	and.l	#$40404040,d1 	move.l	d1,d2 	lsr.l	#6,d2 	sub.l	d2,d1 	or.l	d3,d1 	and.l	#$3f3f3f3f,d1
  My 7 cycle is still the fastest. :D 
  ok, By using superscalar (pairing two longword pixels in one go) . Kalms method givel  4 cycles per longword and my routine 5. ..		 
	 | 
	 
 
	
	sp_ 
		Member		 | 
	
	
	 
		On 030 I would have done the following:
  add.l	d0,d1 move.l	(a0,d1.w),d0 swap	d1 or.l	(a1,d1.w),d0
  The lookup can scrable the bits so that the c2p can run at copyspeed.		 
	 | 
	 
 
	
	Blueberry 
		Member		 | 
	
	
	 
		You can speed it up even more by only keeping track of whether the 6 bits have overflowed, while making sure never to overflow all 8 bits, but postponing the actual saturation until c2p time.
  This can be done using 4 instructions (thus, 2 cycles per longword if you pair them) after the blending add, like this:
  move.l d0,d2 and.l #$80808080,d2 lsr.l #1,d2 sub.l d2,d0
  To see how this works, consider the overflow bits (two most significant bits of each byte): Before any 6-bit overflow, the bits are 00 - these are not changed by the above instructions. After the first overflow, the bits are 01 - also unchanged. At the second overflow, the bits become 10 - the 1-bit is extracted, shifted down and subtracted, yielding 01. Thus, the overflow bits stay at 01 after the first overflow, no matter how many overflows occur.
  This is exactly the overflow bit pattern expected by all the saturation code pieces given in this thread. So you take any one of them and stick it in your c2p right after loading the chunky data into registers. If you don't have any other postprocessing code in your c2p, the conversion code will take far less time than the writing to chip memory, so the extra saturation code will be essentially free.		 
	 | 
	 
 
	
	Kalms 
		Member		 | 
	
	
	
		Neat! (It took me a minute or two to figure out that what Blueberry is suggesting above is to use the 4-instruction sequence for avoiding overflow when adding 3+ layers, and then run the "standard" saturation code afterward  [possibly inside the c2p].) So it would be used something like this when combining 6 layers: 	move.l	(a0)+,d0 	add.l	(a1)+,d0 	add.l	(a2)+,d0 	AVOID_OVERFLOW 	add.l	(a3)+,d0 	AVOID_OVERFLOW 	add.l	(a4)+,d0 	AVOID_OVERFLOW 	add.l	(a4)+,d0 	AVOID_OVERFLOW 	SATURATE 	move.l	d0,(a6)+ ... and then the SATURATE (and perhaps the last AVOID_OVERFLOW operation as well) can be done during c2p processing. ... someone correct me if I'm wrong please.		  
	 | 
	 
 
	
	Blueberry 
		Member		 | 
	
	
	 
		I was thinking more of the case where you do an unknown number of blends for each pixel, and where the individual blends are independent, for instance when blending lots of particles. The assumption here is that you don't know whether a blend operation is the first one for a pixel, so you need to always do the overflow avoid sequence.
  If, as in your example, you know how many blends come before and after each blend for this particular pixel, then some of the overflow avoid sequences can indeed be left out. You can leave out the first, as you did, and if the saturation code can handle arbitrary overflow bit configurations (i.e. it saturates on 01, 10 and 11), then the two last ones can also be left out.
  Such a saturation operation costs just two more instructions. Instead of
  move.l d0,d2 and.l #$40404040,d2
  we have
  move.l d0,d2 lsr.l #1,d2 or.l d0,d2 and.l #$40404040,d2		 
	 | 
	 
 
	
	Kalms 
		Member		 | 
	
	
	 
		Interesting. Thanks!		 
	 | 
	 
 
  
	
	
	
	
			
		 | 
		 |