A.D.A. Amiga Demoscene Archive

Amiga Demoscene Archive Forum / Coding / 68060-specific optimising

Author	Message
winden Member	#1 - Posted: 30 May 2006 13:25 Reply Quote I've revising a bit chapter 10 of motorola 68060 book, if any of you can check and confirm I'm reading it correctly, please answer back: page 10, 10.2.3: --> if you modify A0 register, then just after doing a (a0,d0.l) addressing mode it will stall for 2 cycles --> if you modify d0 register, then just after doing a (a0,d0.l1) or (a0,d0.l4) mode will stall for 2 cycles --> if you modify d0 register, then just after doing a (a0,d0.l2) or (a0,d0.l8) or (a0,d0.w*n) mode will stall for 3 cycles --> useful exception 1: if the modification was by doing a lea, then the stalls don't apply page 8, 10.1.4: --> 68060 data cache supports a single operand reference per machine cycle page 7, deep down --> you can make both a simple instruction and a conditional branch in the same cycle, if the branch is predicted by the branch cache. if the branch is predicted as not-taken, then you can also execute another simple instruction n parallel (so all of 2 insts and a jump get executed) what i infer from these rules is that a standard 8.16 texturing innerloop should be coded like this: .loop move.w d0,d5 move.b d1,d5 addx.l d2,d0 addx.l d3,d1 move.b (a0,d5.l),(a1)+ dbf d7,.loop so that the addx instructions fill out the stall we would have if we did the mapping just after the move: .loop move.w d0,d5 move.b d1,d5 move.b (a0,d5.l),(a1)+ ; 2 cycle stall + 1 cycle read + 1 cycle write addx.l d2,d0 addx.l d3,d1 dbf d7,.loop All in all these pages from 060 book are written in a not-so-simple to understand manner... too much prose and too few straight facts I think ;)
krabob Member	#2 - Posted: 31 May 2006 09:22 - Edited Reply Quote absolutely fabulous !! Übberinterresting !!! I had to verify my loops in karate with that now... (of course I can't confirm anything :-).
Kalms Member	#3 - Posted: 1 Jun 2006 00:33 - Edited Reply Quote Regarding EA calculations: yes, the base/index register stalls work like that. Regarding texturemapper innerloop: * ADDX is pOEP-only (so it is not a 'simple' instruction) (10.1.2, table 10-2) * the MOVE.W / MOVE.B pair has the same destination register, so they cannot pair (10.1.6, yes the rule applies also for write-only "destination" registers) * the DBF is pOEP-only too (10.1.2, table 10-2), and it will take 1 full cycle to execute (table 10-17) What you want to do is: * Split the mem,mem move into one load and one store. then you get two pOEP\|sOEP ops which you can pair with other ops. * move the MOVE.W/MOVE.B into separate cycles * use SUBQ/BNE or CMPA/BNE for loop control, because correctly predicted BNE in the pOEP takes 0 cycles (table 10-17) -- yes, the instruction gets eliminated from the instruction stream and is essentially free if it goes into pOEP! So you get: .loop: move.b (a0,d5.l),d6 ; cycle 1 pOEP move.w d0,d5 ; cycle 1 sOEP move.b d6,(a1)+ ; cycle 2 pOEP move.b d1,d5 ; cycle 2 sOEP addx.l d2,d0 ; cycle 3 pOEP+sOEP addx.l d3,d1 ; cycle 4 pOEP+sOEP cmpa.l a1,a2 ; cycle 5 pOEP ; resource conflict on flags -- cycle 5 sOEP is idle bne.s .loop ; (cycle 6 pOEP, but takes 0 cycles so it's effectively free!) ... all in all 5 cycles, but there is room for one more instruction during cycle 5. If you can satisfy yourself with less than 16.16 precision then you can go lower than 5 cycles, but the datacache misses begin to dominate down in that territory.
Kalms Member	#4 - Posted: 1 Jun 2006 00:38 - Edited Reply Quote (corrections folded into previous post)
noname Member	#5 - Posted: 1 Jun 2006 18:16 Reply Quote I've revising a bit chapter 10 of motorola 68060 book, if any of you can check and confirm I'm reading it correctly, please answer back: Could somebody please provide me with a link to this "book" (I presume it is available in digital form)?
Kalms Member	#6 - Posted: 1 Jun 2006 18:21 Reply Quote It's called MC68060UM/AD, available somewhere from Motorola's (or Freescale's) site. Alternate location: http://www.depeca.uah.es/docencia/ING-ECA/mm/docu/ Microp/Comps/68060UM.pdf
noname Member	#7 - Posted: 1 Jun 2006 18:26 Reply Quote Got it, thanks!
Krishna Member	#8 - Posted: 2 Jun 2006 17:56 Reply Quote So you get: .loop: move.b (a0,d5.l),d6 ; cycle 1 pOEP move.w d0,d5 ; cycle 1 sOEP move.b d6,(a1)+ ; cycle 2 pOEP move.b d1,d5 ; cycle 2 sOEP addx.l d2,d0 ; cycle 3 pOEP+sOEP addx.l d3,d1 ; cycle 4 pOEP+sOEP cmpa.l a1,a2 ; cycle 5 pOEP ; resource conflict on flags -- cycle 5 sOEP is idle bne.s .loop ; (cycle 6 pOEP, but takes 0 cycles so it's effectively free!) ... all in all 5 cycles, but there is room for one more instruction during cycle 5. If you can satisfy yourself with less than 16.16 precision then you can go lower than 5 cycles, but the datacache misses begin to dominate down in that territory. So it's better to have an innerloop working with group of 4 pixels, you can write longwords and reduce the number of instructions in the loop, I use this kind of loop in the engine :) If you still want to write pixel per pixel (or to init the engine and work on aligned longword memory), I think it's better to have this innerloop for a 16.16 precision : ;d0 = vUuu ;d1 = CCVv with CC = number of pixels to draw ;d3 = #$ffff,deltaVv .loop: move.b (a0, d5.w),d6 ; cycle 1 pOEP move.l d0,d5 ; cycle 1 sOEP move.b d6,(a1)+ ; cycle 2 pOEP move.w d1,d5 ; cycle 2 sOEP lsr.l #8,d5 ; cycle 3 pOEP add.l d2,d0 ; cycle 3 sOEP addx.l d3,d1 ; cycle 4 pOEP-only bgt.s .loop ; free if correctly predicted
Kalms Member	#9 - Posted: 2 Jun 2006 21:56 - Edited Reply Quote The concept is good but you are currently having 2 cycles of AGU stall on d5 during the texel fetch. The easiest way to avoid it is to unroll the loop twice and use two sets of texel-offset (currently just d5) registers. Off the top of my head I don't know of any simple way to make the single-pixel loop run in 4 cycles.
Kalms Member	#10 - Posted: 3 Jun 2006 18:09 - Edited Reply Quote As an aside, here is a 4-cycle loop which has 8.16 precision for U and 8.8 for V: .pixel move.b (a0,d4.l),d5 ; cycle 1 pOEP move.w d0,d4 ; cycle 1 sOEP move.b d2,d4 ; cycle 2 pOEP add.l d1,d0 ; cycle 2 sOEP addx.b d3,d2 ; cycle 3 pOEP+sOEP move.b d5,(a1)+ ; cycle 4 pOEP subq.w #1,d7 ; cycle 4 sOEP bne.s .pixel ; free if predicted correctly
Krishna Member	#11 - Posted: 3 Jun 2006 23:19 Reply Quote you are currently having 2 cycles of AGU stall on d5 during the texel fetch where do you see that in the doc ? It's a bit hard to find, the chapter 10 is very boring to read ^^ If I can find a 68060 board, it will be easy to test, dammit !
Kalms Member	#12 - Posted: 3 Jun 2006 23:42 - Edited Reply Quote Section 10.2.3, page 10-10. That is what winden's initial post in this forum thread is concerned about. Your loop is writing to register d5 2 cycles before it is being used in an EA calculation: lsr.l #8,d5 -- idle cycle move.b (a0,d5.w),d6 According to section 10.2.3, the value in a0 must not be touched in the 2 cycles before the operation with the EA, and the value in d5 must not be touched in the 3 cycles before the operation with the EA. Since there is only one idle cycle, not three, between updating d5 and its use in the EA, two "change/use" penalty cycles will be incurred. That is, both processor pipelines will stall for two cycles before executing the move.b. PS. don't dis chapter 10, it's the coolest of them all! (yes, I have actually read the manual cover to cover once... it was probably during a very rainy day) DS.
winden Member	#13 - Posted: 4 Jun 2006 08:00 Reply Quote quoting myself: >page 8, 10.1.4: > >--> 68060 data cache supports a single operand reference > per machine cycle this one is perhaps the best example of "let's make it sound complicated when it's as simple as ass"... it just means that you can make one datacache access per cycle, either one read or one write. Not one read AND one write, such as a move "(a0,d0.l),(a1)+" ;) @kalmsen: I must have a routine to calc 2 pixels in parallel somewhere at the harddisk :)
winden Member	#14 - Posted: 3 Jul 2006 22:17 Reply Quote So it's better to have an innerloop working with group of 4 pixels, you can write longwords and reduce the number of instructions in the loop, I use this kind of loop in the engine :) Ok, so I must be getting old... how is doing this (8 half-cycles): move.b (a0,d0.w),d7 lsl.l #8,d7 move.b (a0,d0.w),d7 lsl.l #8,d7 move.b (a0,d0.w),d7 lsl.l #8,d7 move.b (a0,d0.w),d7 move.l d7,(a1)+ any better than this (8 half-cycles also): move.b (a0,d0.w),d7 move.b d7,(a1)+ move.b (a0,d0.w),d7 move.b d7,(a1)+ move.b (a0,d0.w),d7 move.b d7,(a1)+ move.b (a0,d0.w),d7 move.b d7,(a1)+ (please assume that there are instructions between each one so that they all can execute in 0.5 cycles) It's now commonly understood that 060 can only access datacache once per cycle, but I don't really see how subsituting memry writes with LSL can help. We could try composing the "pixel caches" with this scheme for 3 half-cycles: move.w (a0,d0.w),d7 move.b (a0,d0.w),d7 move.w d7,(a1)+ but then we are risking two parts: 1. reading from a non-aligned word can take a 1-cycle stall sometimes 2. writing to a non-aligned word can take a 2-cycle stall (but this is solvable by aligning to even pixel start outside innerloop) (btw, me wonders how really fast could a 68000-style one-move-per-pixel mapper would run on 060 ;)
Kalms Member	#15 - Posted: 5 Jul 2006 17:55 Reply Quote @winden... code sequences 1 & 2: One reason why sequence 1 is better: interleaved memory accesses to different memory regions increases the risk of cache thrashing (one texel fetch might kick out the current pixel-write cacheline, and vice versa). The cache is 4-way to reduce the risk of this happening, but the possibility is still there. On the other hand, sequence 1 requires some extra cycles for loop setup. Depending on the particulars of your situation, this may or may not balance out. code sequence 3: Agreed. Sometimes it is faster, sometimes it is slower.
TheDarkCoder Member	#16 - Posted: 6 Jul 2006 11:10 Reply Quote @Winden hi, just to understand better this very interesting discussion: (please assume that there are instructions between each one so that they all can execute in 0.5 cycles) what do you mean with half-cycle? Maybe: the instruction executes in 1 cycle in one of the 060 execution units while another instruction executes in parallel in the second execution unit? regards
winden Member	#17 - Posted: 6 Jul 2006 20:07 Reply Quote yes, that's exactly the point... for example "swap d0" would be called a "1 cycle" instruction, but you can emulate "swap d0" with "rol #8,d0; someotherinst; rol #8,d0"... advantage? "rol" allowed to execute in parallel with more stuff and "swap" is not... so "rol" is better because we can execute another 2 "half-cycle" instructions together with them, thus having total of 4 half-cycles to use
sp_ Member	#18 - Posted: 8 Aug 2006 01:05 - Edited Reply Quote The faster linear mapper you can make is by using (infamous) self modified code. Remember to add some cache logic that will make it possible on 060.. For each poligon you simply replace the 0000 with the apropriate offsets. (add the interpolating code in between the mem writes to pipeline ;) interpolate for each 8th/16th or whatever pixel instead of each pixel. Since the mapping is linear SMC is possible. .loop8 move.b 0000(a0),(a1)+ move.b 0000(a0),(a1)+ move.b 0000(a0),(a1)+ move.b 0000(a0),(a1)+ move.b 0000(a0),(a1)+ move.b 0000(a0),(a1)+ move.b 0000(a0),(a1)+ move.b 0000(a0),(a1)+ (...) This is close to 1 instruction pr pixel. But the textures jump a littlebit. The next optimalization would be to use a small texture wich will fit the datacache. with copyback cache enabled this should almost be as writing a constant colour to your screen buffer.. (as fast as clearing the screeen) =) A fast but not so pretty mapper.. but hey, 50 fps is worth it!!
winden Member	#19 - Posted: 9 Aug 2006 17:51 Reply Quote one-move-per-pixel texturers are rather jumpy... I recall coding a rotozoomer where I just calced 16 X-offsets and then reused them for 20 columns on screen, but obviously it looked really jumpy. So finally for fast 030 mappers decided to draw with 16 or 32 pixel columns but calcing the fill 320 X-offsets... this was the technique used for the dual crossfading rotozoomer on Synthesis (sources available now btw :) For a 3d mapper, I recall this was the main feature on the party 6 demos such as bomb' shaft7 and balance' endolymfa, both of which ran extremely fast with 1x1 resolution on 030. Adapting this technique for 060 would require, as you say, a cache invalidation to make sure it has no old code inside... probably reserving a few pages aside for these loops and then doing CPUSHP so that all caches for this page are flushed optimally would be the fastest way to gain this speedup. Maybe even enough to run in hires ;) (probably capsule' phase one hires vectors ran with something like that) Regarding precision, something to keep in mind is that you could trade mapping precisiong against adding more polygons, but then you have to optimise better the polygon setups. ps. good to have you back sp :)
sp_ Member	#20 - Posted: 9 Aug 2006 19:18 Reply Quote Good to be back winden :) You are rigth, Shaft7 and many other demos used the SMC mapper.. Espessially the older 020 demos. I guess the demos that doesn't run too god on a 060 use smc. How is SMC emulatated in Winuae? with JIT? .. As for fast txturemappers I remember I managed to divide the mapper into two passes. The first pass calculated information pr line. (two poligons normaly share a line) To save divs'es and make it more cachefriendly I calculated all lines. Then the txturemapper(outer and innerloop) got inside the 030 256 byte instruction cache. I had a rotating cow 6-7000 faces 6-7 fps 1x1 fullscreen on a 030 50mhz .
sp_ Member	#21 - Posted: 14 Aug 2006 20:52 - Edited Reply Quote Here is my 1997 txturemapper that fits the 256 byte 030 cache. 3 pass 030 blitter c2p. Navigate with mouse. left / right button zoom out in. both buttons exit. For complete source(datafiles are missing) with exe file: Link to source with exe Here is a modified version wich uses self modify code to generate a "1 inst pr pixel mapper" I described in a previous posting. Link to SMC txturemapper source SMC innerloop (for 8 pixels) (Not optimized) jmp 48(pc,d3.w4) .indre move.b 0000(a6),(a1)+ move.b 0000(a6),(a1)+ move.b 0000(a6),(a1)+ move.b 0000(a6),(a1)+ move.b 0000(a6),(a1)+ move.b 0000(a6),(a1)+ move.b 0000(a6),(a1)+ move.b 0000(a6),(a1)+ move.w d0,d3 move.b d1,d3 lea txture+$ffff/2,a6 add.w d3,a6 add.l d6,d0 addx.l d4,d1 bcs.b .indre (....) Less than 256bytes (polygonloop,outer,innerloop). render: moveq.l #0,d1 ;viktig lea txture+256256/2,a6 move.l a7,.stack move.l #visuallist,.visual bra.w .start cnop 0,8 .poly addq.l #4,-(a2) ; move.l a4,-(a2) ;.facen lea (a3,d3.w4),a3 movem.w (a3)+,d0-d2 move.l 2(a1,d0.w8),d3 ; to start-punkter move.l 2(a1,d1.w8),d4 ; ;------- black-box ------------------- :) cmp.w d3,d4 beq.b .oo blt.b .o exg.l d1,d0 exg.l d3,d4 .o move.l 2(a1,d2.w8),d3 ; to start-punkter exg.l d0,d2 .oo ;d0 og d1 minst d4 er minste koords ;-------------------------------------- r move.w 10(a1,d0.w8),a3 ;xstigningstall til polyfiller move.w 10(a1,d1.w8),a4 ;xstigningstall til polyfiller swap d2 ;hvilken linje peker til neste loop cmp.w a3,a4 bgt.s .oooo beq.w .start exg d0,d1 exg a3,a4 ;a3 < a4 exg d3,d4 .oooo ;a3<a4 tillhørende d0 og d1 moveq.l #1,d5 move.w (a1,d0.w8),d2 ;dy move.w (a1,d1.w8),d6 ;dy cmp.w d2,d6 bne.b .over2 ;------------------- flat top fix --------------------------- move.l d3,d5 sub.l d4,d5 ; setter gt flagget alltid :) beq.b .ooo exg.l a3,a4 exg.l d0,d1 ;------------------------------------------------- -------------- .over2 bgt.b .ooo move.w d6,d2 moveq.l #2,d5 ; .flagg .ooo ;d2 er første dbra move.l d4,d6 ext.l d4 swap d6 lsl.l #8,d4 add.l d4,d4 add.w d6,d4 add.l d4,a0 ;alt er relativt move.l 12(a1,d0.w8),a5 ;txt vektor move.l 12(a1,d1.w8),a7 ;txt vektor sub.l a5,a7 ;x stig txt vektor (x.w,y.w) move.l a4,d6 sub.l a3,d6 move.l a7,d4 ;x ; clr.w d4 divs.l d6,d4 move.w a7,a7 move.l a7,d7 ;y lsl.l #8,d7 divs.l d6,d7 move.l d4,d6 swap d6 move.w d5,d4 ;flag til neste loop i highwordet til s4 move.w d7,d6 ;-- move.l 6(a1,d0.w8),a7 ;Øverste txt coords lsr.l #8,d5 ;light box multi shift :) sub.l a2,a2 .entil swap d4 .ytre move.l a2,d0 move.l d5,d7 asr.l #8,d0 asr.l #8,d7 sub.w d0,d7 bmi.b .ikke move.l a7,d3 lea (a0,d0.w),a1 ;skjerm start move.l a7,d0 swap d3 asr.w #8,d3 ;x .indre move.w d0,d1 move.b d3,d1 add.l d6,d0 addx.w d4,d3 move.b (a6,d1.w),(a1)+ ;move.b #$f,(a1)+ dbf d7,.indre .ikke add.l a3,a2 ;polint start x add.l a4,d5 ;polint delta add.l a5,a7 ;Txturestart x,Y int start add.w #512,a0 subq.w #1,d2 bgt.b .ytre .over move.l a3,d0 move.l a5,d3 move.l .linjer2p(pc),a5 swap d2 ;hent linjepeker lea 10(a5,d2.w*8),a5 move.w -10(a5),d2 ;deltay move.w (a5)+,a3 ;ny xstigntalle til polyfiller move.l (a5)+,a5 ;nytxtinter swap d4 ;-------- black box -------------- ;) .black subq.w #1,d4 ;sub til .tolooper beq.b .entil move.l a3,a4 move.l d0,a3 move.l d3,a5 bgt.b .black ;---------------------------------- .start: lea .kladdp(pc),a2 ; move.l (a2)+,a0 ;kladd ; move.l (a2)+,a1 ;linjer2 ; move.l (a2)+,a4 ;faces movem.l (a2)+,a0/a1/a3/a4 ; move.l (a2),a4 ;visuallist move.w (a4),d3 bpl.w .poly ; laglinje\.ut2 .ut move.l .stack,a7 rts cnop 0,4 .stack dc.l 0 .kladdp dc.l kladd .linjer2p dc.l linjer2 .linjefaces dc.l 0 ;linjefaces .visual dc.l visuallist .hvilkenlinje: dc.w 0 .tolooper dc.w 0 .xaksesaken dc.l 0 .adresse dc.l linjer2
winden Member	#22 - Posted: 7 Sep 2006 18:59 - Edited Reply Quote Ok, I was just browsing pouet' tp95 entries and glazed upon winning 40k intro creep from artwork+polka, with it's full-1x1-bumpmapped-glory donut :) So I got thinking... how far can we push pipelining and inner-loop reversing to help speed up a polygon bumpmapper (with 8bit quantized normals, mind you)? Let's see first the unoptimised case... ml d0,d7 ; d0 = UUuuVVvv rw #8,d7 rl #8,d7 mb a6+d7,d6 ; read bumpmap value mb a5+d6,d6 ; read shading table mb d6,a4++ ; write screen ad a0,d0 ; interpol UV cp a3,a4 ; loop bl .loop (sorry for shorthand mnemonics) So we have got so-called "dependant lookups", first one to get bumpvalue from UV and later to get color from BUMPVALUE, which are each 2 or 3 cycles stalls... what if we run the memory lookups "from down to up"?... this would give a lot of cycles between calcing a value and using it as an offset to read memory... move.l d0,d7 move.b d4,(a4)+ rol.w #8,d7 move.b (a5,d5.l),d4 rol.l #8,d7 move.b (a6,d6.l),d5 add.l a0,d0 move.w d7,d6 cmp.l a3,a4 bne.b .loop (full mnemonics this time ^^)
sp_ Member	#23 - Posted: 14 Sep 2006 15:41 - Edited Reply Quote The pipelining looks good, no AGU stalls etc.. but... . The shadetable read could be removed if you sort the colormap. Let the colour map become the shadetable itself.. Then this innerloop can become one instruction pr pixel SMC as well.
winden Member	#24 - Posted: 14 Sep 2006 22:50 Reply Quote holly crap, what a good idea man! yes i can now fully see how we just dump the new shaded normals into the hardware pallette, thus converting this to standard txtmap and thus usable with a one-mover... the only cost is sending to chipmem a new palette each frame, but this is sure faster that reading twice as much from memory for each pixel ^^
winden Member	#25 - Posted: 2 Dec 2006 18:39 Reply Quote I'm wondering about this tiled-texture innerloop... .loop move.b (a0,d6.l),d5 move.l d0,d6 add.l d2,d0 or.l d1,d6 add.l d3,d1 lsr.l d4,d6 and.l #$0003fcfff,d0 and.l #$00fc03fff,d1 move.b d5,(a1)+ dbra d7,.loop should run in 5c/pixel with .12 precision, which is a bit slow unless you recognize that it should minimize cache misses... any comments?
sp_ Member	#26 - Posted: 6 Dec 2006 04:31 - Edited Reply Quote I have analyzed the bitflow of this innerloop. and.l #$0003fcfff,d0 ;%0000 0011 1111 1100 1111 1111 1111 and.l #$00fc03fff,d1 ;%1111 1100 0000 0011 1111 1111 1111 y=interpolation1 bits, x interpolation2 bits, z interpolation3 bits, - not used(shifted away) %0000 00xx xxxx xx00 ---- ---- ---- = d0 %yyyy yy00 0000 00zz zzzz zzzz zzzz = d1 Insted of using two add registers, you can perform the same interpolation in one register. ;%0yy yyyy 0xxx xxxx x0zz zzzz zzzz zzzz Probobly this loop can be shrinked into: .loop add.l d2,d0 and.l d5,d0 ;%011 1111 0111 1111 1011 1111 1111 1111 $3f df bf ff move.l d0,d6 lsr.l d4,d6 move.b (a0,d6.l),(a1)+ dbra d7,.loop ... But it will require 4 times as much memory for the txure. As for cache, bit 14 and 23 are always 0 so I predict the same cache misses as the previous loop. Depending on the interpolation data it migh be possible to remove the and/ and the 4 times memory needed... If you dont support looping txtures or negative txture cordinates.. It migh require some more work in the outerloop. .loop add.l d2,d0 move.l d0,d6 lsr.l d4,d6 move.b (a0,d6.l),(a1)+ dbra d7,.loop On 060 to achieve optimal speed the loop should be rearranged for pipelining and removing the AGU stall. .. The last loop can be done in SMC as well. 1 inst pr pixel. suitable for 000-030.
winden Member	#27 - Posted: 7 Dec 2006 22:52 Reply Quote Your bitflow was OK except for noticing that both x and y are interpolated... in fact for just a second you had fooled me completely :) but... if you try to merge both interpolators into the same register such as this way: ; d0 XXXX XXX- YYYY YYY- xxxx xxyy yyyy how do you make the carry from x jump into X without affecting Y? and the carry from y to Y without affecting x? Thats the reason for having 2 separate values and the masking on each iteration.. before adding the delta to each value, we have to clear the bitgap and then add the delta with all bitgap bits set to 1, so that we carry correctly from fraction-part into whole-part.
sp_ Member	#28 - Posted: 8 Dec 2006 15:11 - Edited Reply Quote Yes, you are right. I Try again :D Analyzis:¨ .loop move.b (a0,d6.l),d5 ;PEOP move.l d0,d6 ;SOEP 1 cycle add.l d2,d0 ;PEOP or.l d1,d6 ;SOEP 1 cycle add.l d3,d1 ;PEOP lsr.l d4,d6 ;SOEP 1 cycle and.l #$0003fcfff,d0 ;PEOP and.l #$00fc03fff,d1 ;SOEP 1 cycle move.b d5,(a1)+ ;PEOP 1 cycle dbra d7,.loop ;PEOP+SOEP 1 cycle 6 cycles. pr pixel. 060 optimize you can remove one cycle by replacing the dbra with: subq.l #1,d7 ;SOEP bne.b .loop ; free if predicted correctly I tried to make another 12 bit precision (x/y) loop but was unable to beat 5 cycles. 12 bit presition on both x,y .loop move.b (a0,d6.w),d5 ;PEOP add.l d2,d0 ;SOEP 000000000000XXXXXXXXxxxxxxxxxxxx move.l d1,d6 ;PEOP move.l d0,d5 ;SOEP lsr.l d3,d5 ;PEOP 000000000000000000000000XXXXXXXX lsr.l #4,d6 ;SOEP 0000looplooploopYYYYYYYYyyyyyyyy move.b d5,d6 ;PEOP 0000looplooploopYYYYYYYYXXXXXXXX add.l d3,d1 ;SOEP looplooploopYYYYYYYYyyyyyyyyyyyy move.l d5,(a1)+ bcc.b .loop ;0 cycles (branch prediction)
winden Member	#29 - Posted: 8 Dec 2006 17:58 - Edited Reply Quote hmmm yeh i should move the loop counter into higher order bits, didn't remember that dbra is only parallel on PEOP... btw my main idea was not with precision, but with trying to make a fast tiled mapper, which is not what you used... but anyways thinking about mixing the loop counter into the interpolation variables, I've come up with someting that I've never read nor discussed before. Now I won't claim it's my invention since most probably some coder did that before, but anyhow here comes... ; d0 UUUUUU-- ------UU uuuuuuuu uuuuuuuu ; d1 ------VV VVVVVV-- vvvvvvvv vvvvvvvv ; d5 -------- -------- UUUUUUVV VVVVVVUU .loop and.l #%11111100 00000011 11111111 11111111,d0 and.l #%00000011 11111100 11111111 11111111,d1 move.b (a0,d5.l),d6 move.l d0,d5 move.b d6,(a1)+ or.l d1,d5 add.l d3,d1 lsr.l d4,d5 add.l d2,d0 bcc.b .loop now, you may notice the last instructions are the U texture interpolation and the branch exit... where is the loop counter then? that's the trick... for using this loop you have to: 1. before innerloop, calc not only staring U pos but also ending U pos 2. adjust start starting U pos so that it goes crosses zero point just when loop should end 3. adjust texture pointer into the other direction by the same amount so now U interpolator has at the same time the U value and the loop counter... nice! :) caveats: 1. obviously you can't let the texture repeat in U coord, since the repetition point is exactly the trigger for ending the loop. 2. not sure I can use a register and on the same cycle try to overwrite, but as one of them is a EA calculation AFAIK it's ok to do it...
Blueberry Member	#30 - Posted: 31 Jan 2007 23:43 Reply Quote I did some experiments to examine the index register stall behaviour. The timings given in the manual (and here) are correct, but the list of exceptions is not complete. It turns out that all of the following instructions can be used to modify the index register of .l or .l4 right before the indexing instruction without any stall: move.l #x,dn moveq.l #x,dn add.l #x,dn addq.l #x,dn add.l dm,dn sub.l #x,dn subq.l #x,dn sub.l dm,dn clr.l dn for .l2, .l*8 and .w (with any scale) the 3 cycle stall occurs in any case. Of these, the add and sub instructions are certainly the most interesting, since all of the others can be simulated by changing the index instruction. So... any ideas for a use case for this?

A.D.A. Amiga Demoscene Archive, Version 3.0