A.D.A. Amiga Demoscene Archive

  Welcome guest! Please register a new account or log in

  

  

  

log in with SceneID

  

Demos Amiga Demoscene Archive Forum / Coding / 68060-specific optimising

 

Author Message
winden
Member
#1 - Posted: 30 May 2006 13:25
Reply Quote
I've revising a bit chapter 10 of motorola 68060 book, if any of you can check and confirm I'm reading it correctly, please answer back:

page 10, 10.2.3:

--> if you modify A0 register, then just after doing a (a0,d0.l) addressing mode it will stall for 2 cycles

--> if you modify d0 register, then just after doing a (a0,d0.l*1) or (a0,d0.l*4) mode will stall for 2 cycles

--> if you modify d0 register, then just after doing a (a0,d0.l*2) or (a0,d0.l*8) or (a0,d0.w*n) mode will stall for 3 cycles

--> useful exception 1: if the modification was by doing a lea, then the stalls don't apply

page 8, 10.1.4:

--> 68060 data cache supports a single operand reference per machine cycle

page 7, deep down

--> you can make both a simple instruction and a conditional branch in the same cycle, if the branch is predicted by the branch cache. if the branch is predicted as not-taken, then you can also execute another simple instruction n parallel (so all of 2 insts and a jump get executed)

what i infer from these rules is that a standard 8.16 texturing innerloop should be coded like this:

.loop
move.w d0,d5
move.b d1,d5
addx.l d2,d0
addx.l d3,d1
move.b (a0,d5.l),(a1)+
dbf d7,.loop

so that the addx instructions fill out the stall we would have if we did the mapping just after the move:

.loop
move.w d0,d5
move.b d1,d5
move.b (a0,d5.l),(a1)+ ; 2 cycle stall + 1 cycle read + 1 cycle write
addx.l d2,d0
addx.l d3,d1
dbf d7,.loop


All in all these pages from 060 book are written in a not-so-simple to understand manner... too much prose and too few straight facts I think ;)
krabob
Member
#2 - Posted: 31 May 2006 09:22 - Edited
Reply Quote
absolutely fabulous !! Übberinterresting !!! I had to verify my loops in karate with that now... (of course I can't confirm anything :-).
Kalms
Member
#3 - Posted: 1 Jun 2006 00:33 - Edited
Reply Quote
Regarding EA calculations: yes, the base/index register stalls work like that.

Regarding texturemapper innerloop:

* ADDX is pOEP-only (so it is not a 'simple' instruction) (10.1.2, table 10-2)
* the MOVE.W / MOVE.B pair has the same destination register, so they cannot pair (10.1.6, yes the rule applies also for write-only "destination" registers)
* the DBF is pOEP-only too (10.1.2, table 10-2), and it will take 1 full cycle to execute (table 10-17)

What you want to do is:
* Split the mem,mem move into one load and one store. then you get two pOEP|sOEP ops which you can pair with other ops.
* move the MOVE.W/MOVE.B into separate cycles
* use SUBQ/BNE or CMPA/BNE for loop control, because correctly predicted BNE in the pOEP takes 0 cycles (table 10-17) -- yes, the instruction gets eliminated from the instruction stream and is essentially free if it goes into pOEP!

So you get:

.loop:
move.b (a0,d5.l),d6 ; cycle 1 pOEP
move.w d0,d5 ; cycle 1 sOEP
move.b d6,(a1)+ ; cycle 2 pOEP
move.b d1,d5 ; cycle 2 sOEP
addx.l d2,d0 ; cycle 3 pOEP+sOEP
addx.l d3,d1 ; cycle 4 pOEP+sOEP
cmpa.l a1,a2 ; cycle 5 pOEP
; resource conflict on flags -- cycle 5 sOEP is idle
bne.s .loop ; (cycle 6 pOEP, but takes 0 cycles so it's effectively free!)

... all in all 5 cycles, but there is room for one more instruction during cycle 5. If you can satisfy yourself with less than 16.16 precision then you can go lower than 5 cycles, but the datacache misses begin to dominate down in that territory.
Kalms
Member
#4 - Posted: 1 Jun 2006 00:38 - Edited
Reply Quote
(corrections folded into previous post)
noname
Member
#5 - Posted: 1 Jun 2006 18:16
Reply Quote
I've revising a bit chapter 10 of motorola 68060 book, if any of you can check and confirm I'm reading it correctly, please answer back:

Could somebody please provide me with a link to this "book" (I presume it is available in digital form)?
Kalms
Member
#6 - Posted: 1 Jun 2006 18:21
Reply Quote
It's called MC68060UM/AD, available somewhere from Motorola's (or Freescale's) site. Alternate location: http://www.depeca.uah.es/docencia/ING-ECA/mm/docu/ Microp/Comps/68060UM.pdf
noname
Member
#7 - Posted: 1 Jun 2006 18:26
Reply Quote
Got it, thanks!
Krishna
Member
#8 - Posted: 2 Jun 2006 17:56
Reply Quote
So you get:

.loop:
move.b (a0,d5.l),d6 ; cycle 1 pOEP
move.w d0,d5 ; cycle 1 sOEP
move.b d6,(a1)+ ; cycle 2 pOEP
move.b d1,d5 ; cycle 2 sOEP
addx.l d2,d0 ; cycle 3 pOEP+sOEP
addx.l d3,d1 ; cycle 4 pOEP+sOEP
cmpa.l a1,a2 ; cycle 5 pOEP
; resource conflict on flags -- cycle 5 sOEP is idle
bne.s .loop ; (cycle 6 pOEP, but takes 0 cycles so it's effectively free!)

... all in all 5 cycles, but there is room for one more instruction during cycle 5. If you can satisfy yourself with less than 16.16 precision then you can go lower than 5 cycles, but the datacache misses begin to dominate down in that territory.


So it's better to have an innerloop working with group of 4 pixels, you can write longwords and reduce the number of instructions in the loop, I use this kind of loop in the engine :)

If you still want to write pixel per pixel (or to init the engine and work on aligned longword memory), I think it's better to have this innerloop for a 16.16 precision :
;d0 = vUuu
;d1 = CCVv with CC = number of pixels to draw
;d3 = #$ffff,deltaVv

.loop:
move.b (a0, d5.w),d6 ; cycle 1 pOEP
move.l d0,d5 ; cycle 1 sOEP
move.b d6,(a1)+ ; cycle 2 pOEP
move.w d1,d5 ; cycle 2 sOEP
lsr.l #8,d5 ; cycle 3 pOEP
add.l d2,d0 ; cycle 3 sOEP
addx.l d3,d1 ; cycle 4 pOEP-only
bgt.s .loop ; free if correctly predicted
Kalms
Member
#9 - Posted: 2 Jun 2006 21:56 - Edited
Reply Quote
The concept is good but you are currently having 2 cycles of AGU stall on d5 during the texel fetch. The easiest way to avoid it is to unroll the loop twice and use two sets of texel-offset (currently just d5) registers. Off the top of my head I don't know of any simple way to make the single-pixel loop run in 4 cycles.
Kalms
Member
#10 - Posted: 3 Jun 2006 18:09 - Edited
Reply Quote
As an aside, here is a 4-cycle loop which has 8.16 precision for U and 8.8 for V:

.pixel
move.b (a0,d4.l),d5 ; cycle 1 pOEP
move.w d0,d4 ; cycle 1 sOEP
move.b d2,d4 ; cycle 2 pOEP
add.l d1,d0 ; cycle 2 sOEP
addx.b d3,d2 ; cycle 3 pOEP+sOEP
move.b d5,(a1)+ ; cycle 4 pOEP
subq.w #1,d7 ; cycle 4 sOEP
bne.s .pixel ; free if predicted correctly
Krishna
Member
#11 - Posted: 3 Jun 2006 23:19
Reply Quote
you are currently having 2 cycles of AGU stall on d5 during the texel fetch


where do you see that in the doc ? It's a bit hard to find, the chapter 10 is very boring to read ^^

If I can find a 68060 board, it will be easy to test, dammit !
Kalms
Member
#12 - Posted: 3 Jun 2006 23:42 - Edited
Reply Quote
Section 10.2.3, page 10-10. That is what winden's initial post in this forum thread is concerned about.

Your loop is writing to register d5 2 cycles before it is being used in an EA calculation:

lsr.l #8,d5
-- idle cycle
move.b (a0,d5.w),d6

According to section 10.2.3, the value in a0 must not be touched in the 2 cycles before the operation with the EA, and the value in d5 must not be touched in the 3 cycles before the operation with the EA. Since there is only one idle cycle, not three, between updating d5 and its use in the EA, two "change/use" penalty cycles will be incurred. That is, both processor pipelines will stall for two cycles before executing the move.b.

PS. don't dis chapter 10, it's the coolest of them all! (yes, I have actually read the manual cover to cover once... it was probably during a very rainy day) DS.
winden
Member
#13 - Posted: 4 Jun 2006 08:00
Reply Quote
quoting myself:

>page 8, 10.1.4:
>
>--> 68060 data cache supports a single operand reference
> per machine cycle

this one is perhaps the best example of "let's make it sound complicated when it's as simple as ass"... it just means that you can make one datacache access per cycle, either one read or one write. Not one read AND one write, such as a move "(a0,d0.l),(a1)+" ;)

@kalmsen: I must have a routine to calc 2 pixels in parallel somewhere at the harddisk :)
winden
Member
#14 - Posted: 3 Jul 2006 22:17
Reply Quote
So it's better to have an innerloop working with group of 4 pixels, you can write longwords and reduce the number of instructions in the loop, I use this kind of loop in the engine :)


Ok, so I must be getting old... how is doing this (8 half-cycles):

move.b (a0,d0.w),d7
lsl.l #8,d7
move.b (a0,d0.w),d7
lsl.l #8,d7
move.b (a0,d0.w),d7
lsl.l #8,d7
move.b (a0,d0.w),d7
move.l d7,(a1)+

any better than this (8 half-cycles also):

move.b (a0,d0.w),d7
move.b d7,(a1)+
move.b (a0,d0.w),d7
move.b d7,(a1)+
move.b (a0,d0.w),d7
move.b d7,(a1)+
move.b (a0,d0.w),d7
move.b d7,(a1)+

(please assume that there are instructions between each one so that they all can execute in 0.5 cycles)

It's now commonly understood that 060 can only access datacache once per cycle, but I don't really see how subsituting memry writes with LSL can help.

We could try composing the "pixel caches" with this scheme for 3 half-cycles:

move.w (a0,d0.w),d7
move.b (a0,d0.w),d7
move.w d7,(a1)+

but then we are risking two parts:

1. reading from a non-aligned word can take a 1-cycle stall sometimes

2. writing to a non-aligned word can take a 2-cycle stall (but this is solvable by aligning to even pixel start outside innerloop)

(btw, me wonders how really fast could a 68000-style one-move-per-pixel mapper would run on 060 ;)
Kalms
Member
#15 - Posted: 5 Jul 2006 17:55
Reply Quote
@winden...


code sequences 1 & 2:

One reason why sequence 1 is better: interleaved memory accesses to different memory regions increases the risk of cache thrashing (one texel fetch might kick out the current pixel-write cacheline, and vice versa). The cache is 4-way to reduce the risk of this happening, but the possibility is still there.

On the other hand, sequence 1 requires some extra cycles for loop setup. Depending on the particulars of your situation, this may or may not balance out.


code sequence 3:

Agreed. Sometimes it is faster, sometimes it is slower.
TheDarkCoder
Member
#16 - Posted: 6 Jul 2006 11:10
Reply Quote
@Winden

hi, just to understand better this very interesting discussion:

(please assume that there are instructions between each one so that they all can execute in 0.5 cycles)


what do you mean with half-cycle?
Maybe: the instruction executes in 1 cycle in one of the 060 execution units while another instruction executes in parallel in the second execution unit?

regards
winden
Member
#17 - Posted: 6 Jul 2006 20:07
Reply Quote
yes, that's exactly the point... for example "swap d0" would be called a "1 cycle" instruction, but you can emulate "swap d0" with "rol #8,d0; someotherinst; rol #8,d0"... advantage? "rol" allowed to execute in parallel with more stuff and "swap" is not... so "rol" is better because we can execute another 2 "half-cycle" instructions together with them, thus having total of 4 half-cycles to use
sp_
Member
#18 - Posted: 8 Aug 2006 01:05 - Edited
Reply Quote
The faster linear mapper you can make is by using (infamous) self modified code.
Remember to add some cache logic that will make it possible on 060.. For each poligon
you simply replace the 0000 with the apropriate offsets.

(add the interpolating code in between the mem writes to pipeline ;)
interpolate for each 8th/16th or whatever pixel instead of each pixel.
Since the mapping is linear SMC is possible.


.loop8

move.b 0000(a0),(a1)+
move.b 0000(a0),(a1)+
move.b 0000(a0),(a1)+
move.b 0000(a0),(a1)+
move.b 0000(a0),(a1)+
move.b 0000(a0),(a1)+
move.b 0000(a0),(a1)+
move.b 0000(a0),(a1)+

(...)

This is close to 1 instruction pr pixel. But the textures jump a littlebit.

The next optimalization would be to use a small texture wich will fit the datacache. with copyback cache enabled this should almost be as writing a constant colour to your screen buffer.. (as fast as clearing the screeen) =)

A fast but not so pretty mapper.. but hey, 50 fps is worth it!!
winden
Member
#19 - Posted: 9 Aug 2006 17:51
Reply Quote
one-move-per-pixel texturers are rather jumpy...

I recall coding a rotozoomer where I just calced 16 X-offsets and then reused them for 20 columns on screen, but obviously it looked really jumpy. So finally for fast 030 mappers decided to draw with 16 or 32 pixel columns but calcing the fill 320 X-offsets... this was the technique used for the dual crossfading rotozoomer on Synthesis (sources available now btw :)

For a 3d mapper, I recall this was the main feature on the party 6 demos such as bomb' shaft7 and balance' endolymfa, both of which ran extremely fast with 1x1 resolution on 030.

Adapting this technique for 060 would require, as you say, a cache invalidation to make sure it has no old code inside... probably reserving a few pages aside for these loops and then doing CPUSHP so that all caches for this page are flushed optimally would be the fastest way to gain this speedup. Maybe even enough to run in hires ;) (probably capsule' phase one hires vectors ran with something like that)

Regarding precision, something to keep in mind is that you could trade mapping precisiong against adding more polygons, but then you have to optimise better the polygon setups.

ps. good to have you back sp :)
sp_
Member
#20 - Posted: 9 Aug 2006 19:18
Reply Quote
Good to be back winden :)

You are rigth, Shaft7 and many other demos used the SMC mapper.. Espessially the older 020 demos. I guess the demos that doesn't run too god on a 060 use smc. How is SMC emulatated in Winuae? with JIT?
..
As for fast txturemappers I remember I managed to divide the mapper into two passes. The first pass calculated information pr line. (two poligons normaly share a line)
To save divs'es and make it more cachefriendly I calculated all lines.
Then the txturemapper(outer and innerloop) got inside the 030 256 byte instruction cache. I had a rotating cow 6-7000 faces 6-7 fps 1x1 fullscreen on a 030 50mhz .
sp_
Member
#21 - Posted: 14 Aug 2006 20:52 - Edited
Reply Quote
Here is my 1997 txturemapper that fits the 256 byte 030 cache.
3 pass 030 blitter c2p. Navigate with mouse. left / right button zoom out in.
both buttons exit.

For complete source(datafiles are missing) with exe file:

Link to source with exe

Here is a modified version wich uses self modify code to generate a "1 inst pr pixel mapper" I described in a previous posting.

Link to SMC txturemapper source
SMC innerloop (for 8 pixels)
(Not optimized)

jmp 4*8(pc,d3.w*4)
.indre
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+

move.w d0,d3
move.b d1,d3
lea txture+$ffff/2,a6
add.w d3,a6
add.l d6,d0
addx.l d4,d1
bcs.b .indre


(....)

Less than 256bytes (polygonloop,outer,innerloop).

render:

moveq.l #0,d1 ;viktig
lea txture+256*256/2,a6
move.l a7,.stack
move.l #visuallist,.visual
bra.w .start
cnop 0,8
.poly
addq.l #4,-(a2)
; move.l a4,-(a2) ;.facen

lea (a3,d3.w*4),a3
movem.w (a3)+,d0-d2

move.l 2(a1,d0.w*8),d3 ; to start-punkter
move.l 2(a1,d1.w*8),d4 ;

;------- black-box ------------------- :)
cmp.w d3,d4
beq.b .oo
blt.b .o
exg.l d1,d0
exg.l d3,d4
.o move.l 2(a1,d2.w*8),d3 ; to start-punkter
exg.l d0,d2
.oo ;d0 og d1 minst d4 er minste koords
;-------------------------------------- r

move.w 10(a1,d0.w*8),a3 ;xstigningstall til polyfiller
move.w 10(a1,d1.w*8),a4 ;xstigningstall til polyfiller

swap d2 ;hvilken linje peker til neste loop

cmp.w a3,a4
bgt.s .oooo
beq.w .start
exg d0,d1
exg a3,a4 ;a3 < a4
exg d3,d4
.oooo
;a3<a4 tillhørende d0 og d1
moveq.l #1,d5
move.w (a1,d0.w*8),d2 ;dy
move.w (a1,d1.w*8),d6 ;dy
cmp.w d2,d6
bne.b .over2
;------------------- flat top fix ---------------------------
move.l d3,d5
sub.l d4,d5 ; setter gt flagget alltid :)
beq.b .ooo
exg.l a3,a4
exg.l d0,d1

;------------------------------------------------- --------------
.over2
bgt.b .ooo
move.w d6,d2
moveq.l #2,d5 ; .flagg
.ooo ;d2 er første dbra

move.l d4,d6
ext.l d4
swap d6
lsl.l #8,d4

add.l d4,d4
add.w d6,d4
add.l d4,a0 ;alt er relativt

move.l 12(a1,d0.w*8),a5 ;txt vektor
move.l 12(a1,d1.w*8),a7 ;txt vektor

sub.l a5,a7 ;x stig txt vektor (x.w,y.w)
move.l a4,d6
sub.l a3,d6

move.l a7,d4 ;x
; clr.w d4
divs.l d6,d4

move.w a7,a7
move.l a7,d7 ;y
lsl.l #8,d7
divs.l d6,d7

move.l d4,d6
swap d6

move.w d5,d4 ;flag til neste loop i highwordet til s4

move.w d7,d6
;--
move.l 6(a1,d0.w*8),a7 ;Øverste txt coords
lsr.l #8,d5 ;light box multi shift :)
sub.l a2,a2
.entil
swap d4
.ytre
move.l a2,d0
move.l d5,d7
asr.l #8,d0
asr.l #8,d7
sub.w d0,d7
bmi.b .ikke
move.l a7,d3
lea (a0,d0.w),a1 ;skjerm start

move.l a7,d0
swap d3
asr.w #8,d3 ;x

.indre
move.w d0,d1
move.b d3,d1
add.l d6,d0
addx.w d4,d3
move.b (a6,d1.w),(a1)+ ;move.b #$f,(a1)+
dbf d7,.indre
.ikke
add.l a3,a2 ;polint start x
add.l a4,d5 ;polint delta
add.l a5,a7 ;Txturestart x,Y int start

add.w #512,a0
subq.w #1,d2
bgt.b .ytre

.over
move.l a3,d0
move.l a5,d3

move.l .linjer2p(pc),a5
swap d2 ;hent linjepeker

lea 10(a5,d2.w*8),a5
move.w -10(a5),d2 ;deltay

move.w (a5)+,a3 ;ny xstigntalle til polyfiller
move.l (a5)+,a5 ;nytxtinter

swap d4
;-------- black box -------------- ;)
.black
subq.w #1,d4 ;sub til .tolooper
beq.b .entil

move.l a3,a4
move.l d0,a3

move.l d3,a5
bgt.b .black

;----------------------------------
.start:
lea .kladdp(pc),a2
; move.l (a2)+,a0 ;kladd
; move.l (a2)+,a1 ;linjer2
; move.l (a2)+,a4 ;faces

movem.l (a2)+,a0/a1/a3/a4

; move.l (a2),a4 ;visuallist

move.w (a4),d3
bpl.w .poly ; laglinje\.ut2
.ut
move.l .stack,a7

rts

cnop 0,4
.stack dc.l 0
.kladdp dc.l kladd
.linjer2p dc.l linjer2
.linjefaces dc.l 0 ;linjefaces
.visual dc.l visuallist
.hvilkenlinje: dc.w 0
.tolooper dc.w 0
.xaksesaken dc.l 0
.adresse dc.l linjer2
winden
Member
#22 - Posted: 7 Sep 2006 18:59 - Edited
Reply Quote
Ok, I was just browsing pouet' tp95 entries and glazed upon winning 40k intro creep from artwork+polka, with it's full-1x1-bumpmapped-glory donut :)

So I got thinking... how far can we push pipelining and inner-loop reversing to help speed up a polygon bumpmapper (with 8bit quantized normals, mind you)?

Let's see first the unoptimised case...

ml d0,d7 ; d0 = UUuuVVvv
rw #8,d7
rl #8,d7
mb a6+d7,d6 ; read bumpmap value
mb a5+d6,d6 ; read shading table
mb d6,a4++ ; write screen
ad a0,d0 ; interpol UV
cp a3,a4 ; loop
bl .loop

(sorry for shorthand mnemonics)

So we have got so-called "dependant lookups", first one to get bumpvalue from UV and later to get color from BUMPVALUE, which are each 2 or 3 cycles stalls... what if we run the memory lookups "from down to up"?... this would give a lot of cycles between calcing a value and using it as an offset to read memory...


move.l d0,d7
move.b d4,(a4)+
rol.w #8,d7
move.b (a5,d5.l),d4
rol.l #8,d7
move.b (a6,d6.l),d5
add.l a0,d0
move.w d7,d6
cmp.l a3,a4
bne.b .loop

(full mnemonics this time ^^)
sp_
Member
#23 - Posted: 14 Sep 2006 15:41 - Edited
Reply Quote
The pipelining looks good, no AGU stalls etc.. but...
.
The shadetable read could be removed if you sort the colormap.
Let the colour map become the shadetable itself..
Then this innerloop can become one instruction pr pixel SMC as well.
winden
Member
#24 - Posted: 14 Sep 2006 22:50
Reply Quote
holly crap, what a good idea man! yes i can now fully see how we just dump the new shaded normals into the hardware pallette, thus converting this to standard txtmap and thus usable with a one-mover... the only cost is sending to chipmem a new palette each frame, but this is sure faster that reading twice as much from memory for each pixel ^^
winden
Member
#25 - Posted: 2 Dec 2006 18:39
Reply Quote
I'm wondering about this tiled-texture innerloop...

.loop
move.b (a0,d6.l),d5
move.l d0,d6

add.l d2,d0
or.l d1,d6

add.l d3,d1
lsr.l d4,d6

and.l #$0003fcfff,d0
and.l #$00fc03fff,d1

move.b d5,(a1)+
dbra d7,.loop


should run in 5c/pixel with .12 precision, which is a bit slow unless you recognize that it should minimize cache misses... any comments?
sp_
Member
#26 - Posted: 6 Dec 2006 04:31 - Edited
Reply Quote
I have analyzed the bitflow of this innerloop.

and.l #$0003fcfff,d0 ;%0000 0011 1111 1100 1111 1111 1111
and.l #$00fc03fff,d1 ;%1111 1100 0000 0011 1111 1111 1111

y=interpolation1 bits, x interpolation2 bits, z interpolation3 bits, - not used(shifted away)

%0000 00xx xxxx xx00 ---- ---- ---- = d0
%yyyy yy00 0000 00zz zzzz zzzz zzzz = d1

Insted of using two add registers, you can perform the same interpolation in one register.

;%0yy yyyy 0xxx xxxx x0zz zzzz zzzz zzzz

Probobly this loop can be shrinked into:

.loop
add.l d2,d0
and.l d5,d0 ;%011 1111 0111 1111 1011 1111 1111 1111 $3f df bf ff
move.l d0,d6
lsr.l d4,d6
move.b (a0,d6.l),(a1)+

dbra d7,.loop

...
But it will require 4 times as much memory for the txure. As for cache, bit 14 and
23 are always 0 so I predict the same cache misses as the previous loop.

Depending on the interpolation data it migh be possible to remove the and/ and the 4 times
memory needed...
If you dont support looping txtures or negative txture cordinates.. It migh require some more work in the outerloop.

.loop
add.l d2,d0
move.l d0,d6
lsr.l d4,d6
move.b (a0,d6.l),(a1)+

dbra d7,.loop

On 060 to achieve optimal speed the loop should be rearranged for pipelining and removing the AGU stall.
..
The last loop can be done in SMC as well. 1 inst pr pixel. suitable for 000-030.
winden
Member
#27 - Posted: 7 Dec 2006 22:52
Reply Quote
Your bitflow was OK except for noticing that both x and y are interpolated... in fact for just a second you had fooled me completely :)

but... if you try to merge both interpolators into the same register such as this way:

; d0 XXXX XXX- YYYY YYY- xxxx xxyy yyyy

how do you make the carry from x jump into X without affecting Y? and the carry from y to Y without affecting x?

Thats the reason for having 2 separate values and the masking on each iteration.. before adding the delta to each value, we have to clear the bitgap and then add the delta with all bitgap bits set to 1, so that we carry correctly from fraction-part into whole-part.
sp_
Member
#28 - Posted: 8 Dec 2006 15:11 - Edited
Reply Quote
Yes, you are right.
I Try again :D

Analyzis:¨

.loop
move.b (a0,d6.l),d5 ;PEOP
move.l d0,d6 ;SOEP 1 cycle

add.l d2,d0 ;PEOP
or.l d1,d6 ;SOEP 1 cycle

add.l d3,d1 ;PEOP
lsr.l d4,d6 ;SOEP 1 cycle

and.l #$0003fcfff,d0 ;PEOP
and.l #$00fc03fff,d1 ;SOEP 1 cycle

move.b d5,(a1)+ ;PEOP 1 cycle
dbra d7,.loop ;PEOP+SOEP 1 cycle

6 cycles. pr pixel.

060 optimize you can remove one cycle by replacing the dbra with:

subq.l #1,d7 ;SOEP
bne.b .loop ; free if predicted correctly

I tried to make another 12 bit precision (x/y) loop but was unable to beat 5 cycles.

12 bit presition on both x,y

.loop
move.b (a0,d6.w),d5 ;PEOP
add.l d2,d0 ;SOEP 000000000000XXXXXXXXxxxxxxxxxxxx

move.l d1,d6 ;PEOP
move.l d0,d5 ;SOEP

lsr.l d3,d5 ;PEOP 000000000000000000000000XXXXXXXX
lsr.l #4,d6 ;SOEP 0000looplooploopYYYYYYYYyyyyyyyy

move.b d5,d6 ;PEOP 0000looplooploopYYYYYYYYXXXXXXXX
add.l d3,d1 ;SOEP looplooploopYYYYYYYYyyyyyyyyyyyy

move.l d5,(a1)+
bcc.b .loop ;0 cycles (branch prediction)
winden
Member
#29 - Posted: 8 Dec 2006 17:58 - Edited
Reply Quote
hmmm yeh i should move the loop counter into higher order bits, didn't remember that dbra is only parallel on PEOP...

btw my main idea was not with precision, but with trying to make a fast tiled mapper, which is not what you used... but anyways thinking about mixing the loop counter into the interpolation variables, I've come up with someting that I've never read nor discussed before. Now I won't claim it's my invention since most probably some coder did that before, but anyhow here comes...

; d0 UUUUUU-- ------UU uuuuuuuu uuuuuuuu
; d1 ------VV VVVVVV-- vvvvvvvv vvvvvvvv
; d5 -------- -------- UUUUUUVV VVVVVVUU

.loop
and.l #%11111100 00000011 11111111 11111111,d0
and.l #%00000011 11111100 11111111 11111111,d1

move.b (a0,d5.l),d6
move.l d0,d5

move.b d6,(a1)+
or.l d1,d5

add.l d3,d1
lsr.l d4,d5

add.l d2,d0
bcc.b .loop


now, you may notice the last instructions are the U texture interpolation and the branch exit... where is the loop counter then? that's the trick... for using this loop you have to:

1. before innerloop, calc not only staring U pos but also ending U pos

2. adjust start starting U pos so that it goes crosses zero point just when loop should end

3. adjust texture pointer into the other direction by the same amount

so now U interpolator has at the same time the U value and the loop counter... nice! :)

caveats:

1. obviously you can't let the texture repeat in U coord, since the repetition point is exactly the trigger for ending the loop.

2. not sure I can use a register and on the same cycle try to overwrite, but as one of them is a EA calculation AFAIK it's ok to do it...
Blueberry
Member
#30 - Posted: 31 Jan 2007 23:43
Reply Quote
I did some experiments to examine the index register stall behaviour. The timings given in the manual (and here) are correct, but the list of exceptions is not complete. It turns out that all of the following instructions can be used to modify the index register of .l or .l*4 right before the indexing instruction without any stall:

move.l #x,dn
moveq.l #x,dn
add.l #x,dn
addq.l #x,dn
add.l dm,dn
sub.l #x,dn
subq.l #x,dn
sub.l dm,dn
clr.l dn

for .l*2, .l*8 and .w (with any scale) the 3 cycle stall occurs in any case.

Of these, the add and sub instructions are certainly the most interesting, since all of the others can be simulated by changing the index instruction. So... any ideas for a use case for this?

 

  Please register a new account or log in to comment

  

  

  

 

A.D.A. Amiga Demoscene Archive, Version 3.0