A.D.A. Amiga Demoscene Archive

        Welcome guest!

  

  

  

log in with SceneID

  

Demos Amiga Demoscene Archive Forum / A.D.A / Testing new ADA - ignore this post!

 

Author Message
Raylight
Member
#1 - Posted: 25 Mar 2014 13:15 - Edited
Reply Quote
...move along... :)

there are no blank lines in the following code section:

        moveq   #$ff,d7
.loop
move.l (a0)+,(a1)+
dbra d7,.loop


Exact replica of this thread below. (i.e. should be identical ascii submitted in this one as the old one)

...
Raylight
Member
#2 - Posted: 25 Mar 2014 13:16
Reply Quote
1. The simple "clear buffer". Seems obvious now but anyway.. After testing various ways, Clr.l (a0)+ .. Move.l #0,(a0)+ .. 4 * Move.b #0,(a0)+ .. Movem.l .., they all seem more or less equally fast (.b slightly slower)... and move16 is significantly faster. move16 is the way to go, right?

      move16   (.cache_aligned_zeroed_data_in_cache).l,(a0)+



2. Better to read/write longs and split to bytes instead of read/write bytes directly? (sequential accesses) Getting some mixed results here, and if I remember correctly the 060 handles sequential byte writes pretty good, or? Naturally, this is for cases where there's no easy way to process 4 bytes using longwords directly.

      move.l   (a0)+,d0      ; Perhaps *4 to process a whole cache line

... ; Extra fiddling to get the individual bytes..
move.l d1,(a1)+


or

      move.b   (a0)+,d0      ; Processing scattered here and there

move.b (a0)+,d1 ; to be pOEP+sOEP friendly :)
...
move.b d2,(a1)+
move.b d3,(a1)+
Raylight
Member
#3 - Posted: 25 Mar 2014 13:19 - Edited by Admin
Reply Quote
3. How do you guys handle an "simple" additive buffer blend, r = min(a + b,255), when you need 0-255 values?

My old code uses branches, and I've tried various scc/subx-or constructs with both long and byte r/w, and none of the them beats the original version.. hmm.. Well, I do remember optimizing that one for the 060. Anyway the original goes like this:

      ; In-place, i.e. a = min(a+b,255)

move.l (a0),d0 ; src1, dst
move.l (a1)+,d1 ; src2

add.b d0,d1
bcc.b .ok1
move.b d2,d1 ; d2 = 255
.ok1 ror.l #8,d0
ror.l #8,d1

... ; same for the other bytes

.ok4 ror.l #8,d1
move.l #d1,(a0)+


Seems like branch cache goes a long way..? My branch-free attempts are significantly slower, and it feels like you'd need an improbable mix of sat/no-sat in the a+b result to get any significant amount of prediction errors.. ?

So how does one do this in 2013? :D Surely there must be a faster way, right? I'll be using 64 and 128 color versions a lot, but still, the full 256 color blend is required in some cases. (Thanks Dalton and Blueberry for explaining the "no-overflow" techniques irl and here!)
rle
Admin
#4 - Posted: 25 Mar 2014 21:37 - Edited
Reply Quote
testing testing 123

      move.l   (a0)+,d0      ; Perhaps *4 to process a whole cache line
... ; Extra fiddling to get the individual bytes..
move.l d1,(a1)+

Raylight
Member
#5 - Posted: 25 Mar 2014 22:40
Reply Quote
A. pasted from text with windows EOL Format

      move.l   (a0)+,d0      ; Perhaps *4 to process a whole cache line
... ; Extra fiddling to get the individual bytes..
move.l d1,(a1)+


B. pasted from text with unix EOL Format

      move.l   (a0)+,d0      ; Perhaps *4 to process a whole cache line
... ; Extra fiddling to get the individual bytes..
move.l d1,(a1)+

 

  Please log in to comment

  

  

  

 

A.D.A. Amiga Demoscene Archive, Version 3.0