A.D.A. Amiga Demoscene Archive

  Welcome guest! Please register a new account or log in

  

  

  

Demos Amiga Demoscene Archive Forum / Coding / The wonderful 060... ooohh, where to begin (again)..

 

Author Message
Raylight
Member
#1 - Posted: 5 Feb 2013 13:33
Reply Quote
The wonderful 060... ooohh, where to begin (again)..

Hi!

Long time no me! Thanks to an incredibly kind gesture I now finally have an 060/50 installed! :) :) :) Getting something done in time for DS13 is about as likely as me creating a 50fps-full-overscan-superhires-nine-bpl-zoomrot though, but hey, I could at least try. :D

Now, I've been fiddling with one of my effects from our last demo, trying to optimize the parts (it's like 90% 030-code in there), but have quickly realized how much about 060 "tricks" I've forgotten, and many of my optimizations seem to give worse performance than the old 030-ish code, and some case goes against my intuition.

So.... Anyone care to give me some pointers on some basic typical/common innerloops to give me a head start? Any thoughts on my following rambling would be appreciated hehe. :)

..Lots of stuff.. putting it in a few replies..

(All code assume a typical dbra/bcc loop with some REPT or manual folding etc..)
Raylight
Member
#2 - Posted: 5 Feb 2013 13:34
Reply Quote
1. The simple "clear buffer". Seems obvious now but anyway.. After testing various ways, Clr.l (a0)+ .. Move.l #0,(a0)+ .. 4 * Move.b #0,(a0)+ .. Movem.l .., they all seem more or less equally fast (.b slightly slower)... and move16 is significantly faster. move16 is the way to go, right?

      move16   (.cache_aligned_zeroed_data_in_cache).l,(a0)+



2. Better to read/write longs and split to bytes instead of read/write bytes directly? (sequential accesses) Getting some mixed results here, and if I remember correctly the 060 handles sequential byte writes pretty good, or? Naturally, this is for cases where there's no easy way to process 4 bytes using longwords directly.

      move.l   (a0)+,d0      ; Perhaps *4 to process a whole cache line
... ; Extra fiddling to get the individual bytes..
move.l d1,(a1)+


or

      move.b   (a0)+,d0      ; Processing scattered here and there
move.b (a0)+,d1 ; to be pOEP+sOEP friendly :)
...
move.b d2,(a1)+
move.b d3,(a1)+
Raylight
Member
#3 - Posted: 5 Feb 2013 13:35
Reply Quote
3. How do you guys handle an "simple" additive buffer blend, r = min(a + b,255), when you need 0-255 values?

My old code uses branches, and I've tried various scc/subx-or constructs with both long and byte r/w, and none of the them beats the original version.. hmm.. Well, I do remember optimizing that one for the 060. Anyway the original goes like this:

      ; In-place, i.e. a = min(a+b,255)

move.l (a0),d0 ; src1, dst
move.l (a1)+,d1 ; src2

add.b d0,d1
bcc.b .ok1
move.b d2,d1 ; d2 = 255
.ok1 ror.l #8,d0
ror.l #8,d1

... ; same for the other bytes

.ok4 ror.l #8,d1
move.l #d1,(a0)+


Seems like branch cache goes a long way..? My branch-free attempts are significantly slower, and it feels like you'd need an improbable mix of sat/no-sat in the a+b result to get any significant amount of prediction errors.. ?

So how does one do this in 2013? :D Surely there must be a faster way, right? I'll be using 64 and 128 color versions a lot, but still, the full 256 color blend is required in some cases. (Thanks Dalton and Blueberry for explaining the "no-overflow" techniques irl and here!)


4. The classical offset-table + texture -> screen-with-distorted-texture thingy. :) That is: scr(i) = tex(table(i)). Which, let's say.. is "sometimes" used in 030-demos.. ;) Ok, I know I can come a long way with "grid expansion"/interpolate between table lookups, tiling to increase cache hits, etc.. But in this case I wonder about the "standard" innerloop itself.
On 030, using generated code was afaik the fastest way in simple cases, and on 060 generated code is bad, well that's what I feel and hear at least. Because what really baffles me is that none of my current attempts have been able to beat the generated code approach. Right now I'm testing with rather bad locality of texture reads though (although should get a decent amount of cache-hits), so the situation might change.

      ;a0 = screen
;a1 = texture
;a2 = table with word offsets

move.b xxxx.w(a1),(a2)+ ; 0x10EA xxxx
move.b yyyy.w(a1),(a2)+ ; 0x10EA yyyy
...
move.b qqqq.w(a1),(a2)+ ; num screen pixels of these..


and the naive non-gen-code approach:

      move.w   (a2)+,d0
move.w (a2)+,d1
move.w (a2)+,d2
move.w (a2)+,d3
move.b (a1,d0.w),(a0)+
move.b (a1,d1.w),(a0)+
move.b (a1,d2.w),(a0)+
move.b (a1,d3.w),(a0)+


and then I've been playing with all kinds of variants of something like this:

      move.l   (a2)+,d0      
move.l (a2)+,d1 ; 2 more loads omitted

move.b (a1,d0.w),d3
... ; Fiddle around with swap/ror for index and
... ; shifting etc to merge results into longword.
move.b (a1,d0.w),d4
move.b (a1,d1.w),d5
move.b (a1,d1.w),d6
...
move.l d2,(a0)+ ; 4 bytes eventually end up in a longword.


At best, I've managed to barely match the gen-code-version, but then:
1. it's very sensitive to instruction placement, even changes that seemingly shouldn't affect pairing or # cycles between bus accesses.
2. It uses lot's of registers, which makes it difficult to utilize the free time, I know exists here and there, for something useful... and
3. I just love 060's complexity when it comes to predicting memory accesses. :D

How do you code a loop like this? Going away from gen-code would be great, but then I'd need a simpler loop to be able to add other computations and such.
Raylight
Member
#4 - Posted: 5 Feb 2013 13:36
Reply Quote
5. The simplest blur..
I have this hastily written blur that just samples +/-1 in x and y, and takes the average. As above, the case is the full 0-255 range. While I can think of smarter solutions, they'd require some structural changes, and for now (as I'm relearning the 060) I'm simply interested in optimizing algorithm as is (w/o losing precision and still being an in-place blur). The naive version:

      ; a0,a1 = src and dst. (dst needs a 1-line offset)

move.b 1(a0),d0
move.b -1(a0),d2
add.l d2,d0
move.b 256(a0),d2
add.l d2,d0
move.b -256(a0),d2
add.l d2,d0
lsr.l #2,d0
move.b d0,(a1)+
addq.l #1,a0


...which of course can be optimized a lot by processing several bytes per loop to pair instructions and reuse read data. Doing this however, it gets faster for sure, but nowhere near the speed I know can be achieved. Despite non-sequential reads an almost optimal cache hit-ratio should be possible.

Any hints on approaching a loop like this one? More generally, a loop that samples a neighbourhood, where src buffer == dst buffer, that needs the full 0-255 range, and is suspect to temporary overflow (e.g. byte + byte, then shift..)..?


6. Well, that's a lot already! Cheers!
dalton
Member
#5 - Posted: 5 Feb 2013 18:28
Reply Quote
Congratulations on your new 1260 board!

Regarding question 1, there was a thread about this a while ago: http://ada.untergrund.net/forum/index.php?action=v thread&forum=4&topic=613&page=0

See you at Kelly's tomorrow?
Blueberry
Member
#6 - Posted: 6 Feb 2013 17:13 - Edited
Reply Quote
Yay, a hardcore 060 coding thread. :-D

OK, some suggestions:

1) Yes, move16 is the only instruction which avoids reading in the cache line it is writing to.

2) There are no special penalties for byte writes as far as I know. All aligned writes take 1 cycle, and a write to an uncached line will cause a stall, waiting for the line to be fetched. See this thread for more on memory stalls.

3) Yeah, the branch predictor is pretty efficient when a branch almost always does the same thing. As for other ways to do it, here's an idea:

Let A, B and C denote the most significant bits of our two addition inputs and result, respectively. If A and B are both zero, overflow is not possible. If A and B are both 1, overflow is inevitable. If one of A and B is 0 and the other is 1, overflow can occur, but if it does, C is always 0, and if it doesn't, C is always 1. Thus, we can compute whether overflow has occurred by (A & B) | ((A | B) & !C). We can do this in parallel for all 4 bytes of a longword and then use those bits to correct the overflow into the next byte and to saturate the overflowed bytes.

As far as I can see, the whole computation can be done in 16 instructions, not including the loads and stores. With appropriate pairing, this is 2 cycles per byte. Your version with an add, a predicted branch and two paired rotates is also 2 cycles, but here it is possible to pair the adds with the loads and stores, so that version might still be faster when the number of overflows is very small. But the parallel version is independent of overflows and of course cooler. ;)

4) Ignoring cache misses, the generated version is 2 cycles per pixel. The table-based one can in theory be done in 1.75 cycles per pixel if you load two offsets at a time and store 4 pixels at a time.

Taking cache misses into account, the story is of course very different. It would not surprise me if the instruction prefetcher is much better at hiding memory stalls than the data cache fetcher, making the generated version superior.

When scheduling instruction that use indexed addressing, keep in mind the stall that can arise between changing a register and using it as an index. The delay is 2 cycles for .l and .l*4 and 3 cycles for all other modes. You have to fill in the space between the change and the use with that many cycles worth of instructions to hide the delay, or the indexing will stall.

5) Maybe handle two pixels at a time (alternating between 00ff00ff and ff00ff00 ish)?

6) More, more! :-D
z5_
Member
#7 - Posted: 6 Feb 2013 19:44
Reply Quote
Right, now that these things are sorted, there is still plenty of time to make something for Datastorm :)

Seriously, Raylight, a while ago i wondered what had happened to you since you vanished. Glad you now have an 1260. Hope you keep coding until you release something sometime in the future.

(i know, off topic...)
Raylight
Member
#8 - Posted: 8 Feb 2013 13:22
Reply Quote
@Dalton

Thanks! Interesting thread I've missed in my browsing! My move16 test was similar to the one you posted, except that it was working.. almost.. ;) The "movem thread" mentioned here by Blueberry was a great disquised topic that lead me to it.

@z5____ :D

Sure, this was all the things that was missing for me to finish my 9 minute demo.. ;) Seriously though, yeah I've been away from ADA a long time, and more present IRL. Haven't had the motivation to code anything before I got the 060. I love optimizing stuff and that's pointless in WinUAE. I'll release something for sure! (There's an unfinished demo from 1998 that means a lot to me, so it simply must be done.) The only question is when. :)

PS. off topic in this forums often leads to interesting stuff, or so I've noted.. :P

@Blueberry

Wow, thanks! Lot's of interesting stuff. You get you own reply :)
Raylight
Member
#9 - Posted: 8 Feb 2013 14:50
Reply Quote
@Blueberry

Hardcore 060 is nice.. Maybe we should have a thread for all those things examining the guts of the beast.. :P

(One thing I forgot to mention btw, I'm currently flushing the caches before every single algorithm I posted.)

1) Yes! Although I often think about the write-back mode. I don't think I've ever seen any mentioning about it at all among the threads I've found, excluding MMU page control stuff perhaps, which makes me wonder if every one base their understanding of the 060 cache/memory mechanics on copy-back mode.. or if there's something very very bad about it that I've missed.. hmm..

2) Yup, that's the thread that contained almost all I needed come to the conclusion regarding mem clear. :) So that's in line with what I remembered then. I'd say for *reads* at least, assuming you have things to put it sOEP in between. Originally what I was "sure" about was writes, but re-reading the 060-manual for the 113th time, I'm now a little uncertain how the write buffer plays a role here. I've been under the impression that it'll receive 4 bytes if possible before writing a long. However, the "4-entry" statement and diffuse descriptions here and there now gives me the impression that it'll write entries asap, no matter if they contain a byte or a long... ?

3) This is just too funny. After posting I coded a branch-free version processing longs, using the fact:
A+ B == (A & B) + (A | B) , and then I see your interesting suggestion! :) I was able to (barely) beat the branch-version... and yes, branch-less, or data-independent, is king! :P
Hmm.. yes, interesting, interleaving loads/stores for the branch-version might make it the faster one after all. In general, i find it hard (more complicated inner loop than wanted) to interleave in cases like this, i.e. when loading 2 sources and the next step is to add them, what to put after the load(s) that can run in parallel? Oh, wait! move.l -> Dn is that nice exception. :)

      Move.l   (a0)+,d0      ;pOEP
Move.l (a1)+,d1 ;pOEP
Add.b d1,d0 ;sOEP due to Move.LONG -> Dn


Still, it hard to add d1 to d0 after the first load, the place I want it the most :D, unless I play with "prefect" or alignment.. I suspect I'd need to do an "inter-iteration-interleave-loop" (urrgh..) to make it work. Or make the loop process enough longs per iteration so that the initial unpairable instructions doesn't matter much... Anything I'm missing here? I'd prefer to avoid interactions between iterations where possible as it adds complexity. The general case would be adding 2 buffers for any reason:

.loop
Move.l (a0)+,d0
;Well well well, what to do here.. :P
Move.l (a1)+,d1
Add.l d1,d0
Move.l d0,(a2)+


..hmm, or maybe I found my solution when writing the above. :) Looks like it would work for bytes too. An approach like this assuming you have more stuff to do after the addition.. any takes?

.loop
Moveq.l #0,d1
Move.l (a0)+,d0
Add.l d0,d1
Move.l (a1)+,d2
Add.l d2,d0
Move.l d0,(a2)+


Back to the branch-free approach. I got down to 18 instructions (+ load,store), where 2 are register duplications. I haven't looked for alternative formulations yet, and those 18 instructions have been hard to pair because of the dependency chain in the computation. I was able to pair the void after the loads though. Thanks for your thorough description, I need to translate both into matching boolean expressions first to see if they differ. Although fast, I'm not happy with the dependency chain I have, which usually arises from things like (just out of my head so could maybe be simplified, which was not my intention!):

T1 = (A & K1) ^ (A & K2) ...need a copy of A..
B = A + T1 ...ouch, also need the original A.

..which can be summarized into anything with lots of terms using the same variable, worsened by needing the both original variable and the original expression (which might have been modified for something) at a later time:

R = (A | 2) + (A * K1) + (A & K2) + (...)
B = (R >> 2) + (A >> 3)
...
C = A + R

Well, now I'm just rambling.. :D

4,5,6.. TBC.. :)
Raylight
Member
#10 - Posted: 8 Feb 2013 16:19
Reply Quote
4) Oh yes, those darn AGU-stalls! The 3 cycle one for .w is especially troublesome in this case (and many others).
I remember I figured the generated version would be fast, since:

1. the offset-table won't be in the cache for any case that processed a whole screen.
2. The instruction fetch is 1 long per clock, and the xxxx.w(a0),(a1)+ is 1 long as well.
3. That instruction requires 2 cycles no matter what.
4. One cache line of instructions produces 4 pixels.
5. So for one cache line of instructions we have:
* 16 bytes instruction fetch
* 4 bytes data read (non-sequential)
* 4 bytes data write (sequential)
6. Which seems like a better utilization of the data cache.

I assume three things: The offsets are read once, i.e. no gain from cache reuse. The writes too (in this case) don't have any cache reuse. And the texture data (or the offset table to be exact) could be optimized for cache locality. In my test it isn't though.

So in my case (cache cleared before loop) the generated version is 6 lines read and 1 line write per 16 pixels. The table based would be 3 lines read and 1 line write... hmm.. :P Of course, how much wasted read texture pixels you get looks like a major factor. My test was 5120 pixels, so for the table based version at least 20kb in total would in one way or anther allocate in the cache. Actually I'm still kind of puzzled about the whole thing!

Anyway, worked out your 1.75 figure, i.e. I see that you need 7 load/store, and 6 other instructions. So assuming perfect pairing, keeping AGU-stalls in mind too so.. yeah.. :D I can't easily see an elegant loop here that doesn't use all registers and beyond though. I certainly can't see any such solution that processes 1 long of pixels per iteration. :D ..my near optimal does one cache line at a time, and it ain't pretty!

If you'd like to share and outline of such a loop I'd be happy.. :) It would be interesting to test which 100% texture cache hits (not perfect locality).

5) Oh thanks! That's a really good simple approach I didn't think of. I'll just have to try it out.. :)

6) Ohhh I've got pllenty of more! :D
For now, here's one thing that's bugging me - the number of data cache accesses per cycle. You know that 060-manual-thingy? The one where the details about one little thing is always scattered in like 5 separate places? Well, there are plenty of places which clearly state that the during one clock 060 is capable of:

1 Instruction cache read
1 Data cache read
1 Data cache write
1 Bus *access*
2 Integer instructions
...

   tst.l   (a0)   ; make sure data is in cache
...
move.l (a0),d0 ; pOEP, data cache read
move.l d0,(a0) ; sOEP, data cache write.. right? :P
; Total of 1 cycle... right? :P


On the other hand, there are plenty of places that simply says one data cache access per cycle. (e.g. data operand pipe.) So, I'm curious about the nature of that "simulatenous data cache read and write". Putting scattered hints here and there toghether, I suspect that simultaneous thing refer to something involving the push buffer or so.. but the not knowing really bugs me! ;)

So anyone care to enlighten me?
randy
Member
#11 - Posted: 10 Apr 2013 09:20
Reply Quote
Going off topic here but I just wanted to say a big Yay! for you getting back on the Amiga. I'm very curious to see what kind of marvels you will conjure up next. And that goes for all you Powerliners.

(Nice meeting you at Datastorm by the way)
Raylight
Member
#12 - Posted: 20 Jun 2013 12:59
Reply Quote
randy:
Going off topic here but I just wanted to say a big Yay! for you getting back on the Amiga. I'm very curious to see what kind of marvels you will conjure up next. And that goes for all you Powerliners.

(Nice meeting you at Datastorm by the way)

YAY!!! Thanks! ..and nice to meet you this year as well! Any developments on that cool idea of effect you showed me btw? :) I had some interesting thoughts on ways to implement it on Amiga after I got home.. hmmmmm.... then I might have to steal that one muahahahah ;-)

Quick desperate attempt at off-topic-damage-contro ;) l:

060-tip: scc is an interesting and useful instruction, not just because it a conditional "set", but since in effect, it allows you to to have more control over in which OEP your instructions execute. This can be useful to reduce the complexity of arranging instructions (reducing complexity on 060 is worth a LOT in successful optimization imho).

More specifically, as if it isn't hard enough to try to avoid memory stalls on the 060 architecture, you also need to analyze in which OEP instructions end up - for optimal pairing. This can be very difficult as you often come up with 2 numbers depending on which OEP the "first" instruction goes in. This is where I've found scc useful - it's one of those few pOEP-only-by-allows-sOEP - i.e. it has to go to the pOEP and execution with stall until that OEP is available, and it also allows another instruction to execute simultaneously.

This allows you to predict the pairing of a chain of instructions as long as no memory access occurs. And this is important since you have 1-cycle instructions and addressing modes etc which may only execute in the pOEP. Considering all other resource conflicts you need to handle, this can make things a lot easier in some situations.

There are of course several ways to achieve this but scc has recently become a favorite of mine. :-) If you have a situation where you can solve some part of the problem with scc equally fast, I encourage you to play around a bit with it!

Hmm, "Quick" became quite a post hehe.. :D I'll leave it up to someone else to show something more telling - an example.. ;)

 

  Please register a new account or log in to comment

  

  

  

 

A.D.A. Amiga Demoscene Archive, Version 3.0