A.D.A. Amiga Demoscene Archive

Amiga Demoscene Archive Forum / A.D.A / Testing new ADA - ignore this post!

Author	Message
Raylight Member	#1 - Posted: 25 Mar 2014 13:15 - Edited Reply Quote ...move along... :) there are no blank lines in the following code section: moveq #$ff,d7 .loop move.l (a0)+,(a1)+ dbra d7,.loop Exact replica of this thread below. (i.e. should be identical ascii submitted in this one as the old one) ...
Raylight Member	#2 - Posted: 25 Mar 2014 13:16 Reply Quote 1. The simple "clear buffer". Seems obvious now but anyway.. After testing various ways, Clr.l (a0)+ .. Move.l #0,(a0)+ .. 4 * Move.b #0,(a0)+ .. Movem.l .., they all seem more or less equally fast (.b slightly slower)... and move16 is significantly faster. move16 is the way to go, right? move16 (.cache_aligned_zeroed_data_in_cache).l,(a0)+ 2. Better to read/write longs and split to bytes instead of read/write bytes directly? (sequential accesses) Getting some mixed results here, and if I remember correctly the 060 handles sequential byte writes pretty good, or? Naturally, this is for cases where there's no easy way to process 4 bytes using longwords directly. move.l (a0)+,d0 ; Perhaps *4 to process a whole cache line ... ; Extra fiddling to get the individual bytes.. move.l d1,(a1)+ or move.b (a0)+,d0 ; Processing scattered here and there move.b (a0)+,d1 ; to be pOEP+sOEP friendly :) ... move.b d2,(a1)+ move.b d3,(a1)+
Raylight Member	#3 - Posted: 25 Mar 2014 13:19 - Edited by Admin Reply Quote 3. How do you guys handle an "simple" additive buffer blend, r = min(a + b,255), when you need 0-255 values? My old code uses branches, and I've tried various scc/subx-or constructs with both long and byte r/w, and none of the them beats the original version.. hmm.. Well, I do remember optimizing that one for the 060. Anyway the original goes like this: ; In-place, i.e. a = min(a+b,255) move.l (a0),d0 ; src1, dst move.l (a1)+,d1 ; src2 add.b d0,d1 bcc.b .ok1 move.b d2,d1 ; d2 = 255 .ok1 ror.l #8,d0 ror.l #8,d1 ... ; same for the other bytes .ok4 ror.l #8,d1 move.l #d1,(a0)+ Seems like branch cache goes a long way..? My branch-free attempts are significantly slower, and it feels like you'd need an improbable mix of sat/no-sat in the a+b result to get any significant amount of prediction errors.. ? So how does one do this in 2013? :D Surely there must be a faster way, right? I'll be using 64 and 128 color versions a lot, but still, the full 256 color blend is required in some cases. (Thanks Dalton and Blueberry for explaining the "no-overflow" techniques irl and here!)
rle Admin	#4 - Posted: 25 Mar 2014 21:37 - Edited Reply Quote testing testing 123 move.l (a0)+,d0 ; Perhaps *4 to process a whole cache line ... ; Extra fiddling to get the individual bytes.. move.l d1,(a1)+
Raylight Member	#5 - Posted: 25 Mar 2014 22:40 Reply Quote A. pasted from text with windows EOL Format move.l (a0)+,d0 ; Perhaps 4 to process a whole cache line ... ; Extra fiddling to get the individual bytes.. move.l d1,(a1)+ B. pasted from text with unix EOL Format move.l (a0)+,d0 ; Perhaps 4 to process a whole cache line ... ; Extra fiddling to get the individual bytes.. move.l d1,(a1)+

A.D.A. Amiga Demoscene Archive, Version 3.0