A.D.A. Amiga Demoscene Archive

        Welcome guest!

  

  

  

log in with SceneID

  

Demos Amiga Demoscene Archive Forum / Coding / compressed mapping

 

Author Message
_Jamie_
Member
#1 - Posted: 15 Dec 2008 01:24
Reply Quote
I tried to find a way to optimise the mapping innerloop, i tested maybe 20-30 routine to do linear and perspective mapping. Finally i will try to use compressed texture because i have never seen this on amiga and it could be awesome for big texture.

i copy some of the innerlop i tested. My rules is to have minimum 16 bits of precision for uv component, keep in mind that my engine don't have overdraw so i can spend more cycle per pixel.

//------------------------------------------------ ----
//classic mapping ( 4cycles)
//------------------------------------------------ ----
//Lot of demo must use this method on 1200 with 030
//------------------------------------------------ ----
//d0 ffff00uu 16 bits
//d1 ffffvvff 24 bits
//------------------------------------------------ ----
addx.l d2,d0
addx.l d3,d1
move.b (a0,d4.l),d5
move.w d1,d4
move.b d5,(a1)+
move.b d0,d4

//------------------------------------------------ ----
//Multisize tiled mapping in V ( 4cycles)
//------------------------------------------------ ----
//multisize is really better for the cache
//------------------------------------------------ ----
//a0 00uuffff 16 bits
//a1 vvffffff 24 bits
//------------------------------------------------ ----
add.l a2,a0
add.l a3,a1
move.l a1,d0
and.l d6,d0
move.b (a0,d1.l),d1
add.l a0,d0
move.b d1,(a1)+
lsr.l d7,d0

add.l a2,a0
add.l a3,a1
move.l a1,d1
and.l d6,d1
move.b (a0,d0.l),d0
add.l a0,d1
move.b d0,(a1)+
lsr.l d7,d1

//------------------------------------------------ ----
//Multisize swizzled tiled (uv) mapping ( 4,5 cycles)
//------------------------------------------------ ----
//i use this one for our next demo
//------------------------------------------------ ----
//d0 u00uffff 16 bits
//d1 0vvfffff 20 bits
//------------------------------------------------ ----
add.l a0,d0
add.l a1,d1
and.l d6,d0
move.l d1,d3
and.l d7,d3
move.b (a5,d2.l),d2
or.l d0,d3
move.b d2,(a6)+
lsr.l d5,d3

add.l a0,d0
add.l a1,d1
and.l d6,d0
move.l d1,d2
and.l d7,d2
move.b (a5,d3.l),d3
or.l d0,d2
move.b d3,(a6)+
lsr.l d5,d2

//------------------------------------------------ ----
//Multisize Compressed Texture Mapping
//------------------------------------------------ ----
//Maybe i will use this one for big texture, could be
//really cool for the cache.It's the unoptimised
//version ( could be done in 6 cycles) and never
//tested
//------------------------------------------------ ----
//d0 00uuffff 16 bits
//d1 vvffffff 20 bits
//------------------------------------------------ ----
add.l a2,a0
add.l a3,a1
move.l a1,d0
and.l d6,d0
add.l a0,d0
lsr.l d7,d0
move.w d0,d1
lsr.w #2,d0
and.b #$3,d1
add.w (a0,d0.l),d1
move.b (a1,d1.l),(a2)+
sp_
Member
#2 - Posted: 15 Dec 2008 12:23 - Edited
Reply Quote
When you plot two pixels per loop, the instructions pipeline well. Tiled textures will improve cachehits and speed. These innerloops will run fast.
.
A cacheline is 16 bytes. if you plot 16 pixels per loop and place the interpolation after the last write to fastmem, you will get alot of free cycles(while the bus is writing). This will only be useful for bigsize polygons or in a perspective correct mapper corrected for every 16 pixels. Copyback mode must be enabled in the cache.

I have outlined a loop below. Didn't finnish it. but I think its around 5 cycles pr pixel.
When the cache is busy writing to fastmem the 060 cpu can execute many free cycles. On Mc68030 50mhz a longword fastmem write can pipeline 12 cycles. I guess a 060 can pipeline more.. A cacheline is 4 longword. So 12*4 cycles are free.

The final calculation would be around (5*16) 80 - 48 cycles = 32 cycles (2 cycles pr pixel.)

.loop16

add.l a2,a0
add.l a3,a1
move.l a1,d0
and.l d6,d0
add.l a0,d0
lsr.l d7,d0

add.l a2,a0
add.l a3,a1
move.l a1,d1
and.l d6,d1
add.l a0,d1
lsr.l d7,d1

add.l a2,a0
add.l a3,a1
move.l a1,d2
and.l d6,d2
add.l a0,d2
lsr.l d7,d2

add.l a2,a0
add.l a3,a1
move.l a1,d3
and.l d6,d3
add.l a0,d3
lsr.l d7,d3

swap d0
swap d2
move.w d1,d0
move.w d3,d2

add.l a2,a0
add.l a3,a1
move.l a1,d1
and.l d6,d1
add.l a0,d1
lsr.l d7,d1

add.l a2,a0
add.l a3,a1
move.l a1,d3
and.l d6,d3
add.l a0,d3
lsr.l d7,d3

add.l a2,a0
add.l a3,a1
move.l a1,d4
and.l d6,d4
add.l a0,d4
lsr.l d7,d4

add.l a2,a0
add.l a3,a1
move.l a1,d5
and.l d6,d5
add.l a0,d5
lsr.l d7,d5

swap d1
swap d3
move.w d4,d1
move.w d5,d3

(...) 8 more interpolations

move.b (a0,d0.w),(a1)+
swap d0
move.b (a0,d1.w),(a1)+
swap d1
move.b (a0,d2.w),(a1)+
swap d2
move.b (a0,d3.w),(a1)+
swap d3
move.b (a0,d0.w),(a1)+
move.b (a0,d1.w),(a1)+
move.b (a0,d2.w),(a1)+
move.b (a0,d3.w),(a1)+
(...) 8 more writes.
_Jamie_
Member
#3 - Posted: 15 Dec 2008 13:27
Reply Quote
move.l a1,d0
and.l d6,d0
add.l a0,d0
lsr.l d7,d0

you have 4 times the same destination register, so it's not paired. You need to interleave the destination register if you want to pair. The move.b ( a0,d0.w),(a1)+ is slower that move.b (a0,d0.l),(a1)+ on 060 ( no idea why)
_Jamie_
Member
#4 - Posted: 15 Dec 2008 14:09
Reply Quote
i tested to do some instruction after 16 bytes write, and it seems to not work, fuck it was a good idea
sp_
Member
#5 - Posted: 15 Dec 2008 14:51 - Edited
Reply Quote
Yes, I was too quick when making this loop. With pairing it will use less cycles.
.
a1 must be alligned to a 16byte boundry. And copyback mode must be enabled in the datacache. Should work (in theory).. But ..
_Jamie_
Member
#6 - Posted: 15 Dec 2008 15:01
Reply Quote
it was aligned and the copyback mode was enabled, theory and practise are really different:)
sp_
Member
#7 - Posted: 15 Dec 2008 15:44 - Edited
Reply Quote
Ok. If you disable copyback mode and render 4 pixels per loop and push longword to memory it might be faster.

The code is not paired for optimal performance.


.loop4

move.l d0,(a1)+

add.l a3,a1
add.l a2,a0
move.l a1,d0
and.l d6,d0
add.l a0,d0
lsr.l d7,d0

add.l a3,a1
add.l a2,a0

move.l a1,d1
and.l d6,d1
add.l a0,d1
lsr.l d7,d1

add.l a3,a1
add.l a2,a0
move.l a1,d2
and.l d6,d2
add.l a0,d2
lsr.l d7,d2

add.l a3,a1
add.l a2,a0

move.l a1,d3
and.l d6,d3
add.l a0,d3
lsr.l d7,d3

move.w (a0,d0.l),d0
move.w (a0,d2.l),d2
move.b (a0,d1.l),d0
move.b (a0,d3.l),d2
swap d0
move.w d2,d0

loop
_Jamie_
Member
#8 - Posted: 15 Dec 2008 16:15
Reply Quote
it's exactly the same result
_Jamie_
Member
#9 - Posted: 15 Dec 2008 16:17
Reply Quote
i mean it doesn't work, you can have only one free instruction after one write
sp_
Member
#10 - Posted: 15 Dec 2008 18:16 - Edited
Reply Quote
I think I tested this on a 060 10 years ago. If a memwrite is 1 cycle, you are writing to the cache and not to memory.

Test the following loop The adds should be "free":

lea fastmem,a0
moveq.l #0,d0

move.l #$ffff,d7
.loop
move.l d0,(a0)+
add.l d1,d2
add.l d3,d4
add.l d5,d6

dbf d7,.loop
_Jamie_
Member
#11 - Posted: 15 Dec 2008 18:30
Reply Quote
i can't test now, but i'm pratically sure that you are writing in the cache if your data cache is enabled, so you don't have free instruction exept when you pair your instruction
sp_
Member
#12 - Posted: 16 Dec 2008 01:40 - Edited
Reply Quote
The datacache is 8KB. if your program is writing within a 8KB block all writes will be 1 cycle. When you start to write outside the block, the cache will start to push cachelines to memory, and your program will slow down.

If d7 is changed to #8192/4 the first run will fill the cache, and the next run will run at 1 cycle since all the memory is already mapped in the cache.
_Jamie_
Member
#13 - Posted: 16 Dec 2008 01:50
Reply Quote
that make sense, if i read and the cache line is not loaded i have ~7 free cycles, so why it don't work when i write?
_Jamie_
Member
#14 - Posted: 16 Dec 2008 03:28
Reply Quote
ok after some test i have some more precise test

move.l (a0)+,d0
move.l (a0)+,d1
move.l (a0)+,d2
move.l (a0)+,d3
tst.b (a0)+
rept 10
clr.l d4
clr.l d5
endr

move.l d0,(a0)+
move.l d1,(a0)+
move.l d2,(a0)+
move.l d3,(a0)+
rept 2
clr.l d4
clr.l d5
endr

So 10 cycles for the read and 2 cycles for the write
sp_
Member
#15 - Posted: 16 Dec 2008 08:56 - Edited
Reply Quote
Edit:
So this meens 2 cycle delay if the memory is already mapped in the cache, and 10 cycles if not.

How about this loop?

tst.b (a0) ;preload cacheline
rept 10
clr.l d4
clr.l d5
endr
move.l (a0)+,d0
move.l (a0)+,d1
move.l (a0)+,d2
move.l (a0)+,d3


tst.b (a0)

rept 10
clr.l d4
clr.l d5
endr
move.l d0,(a0)+
move.l d1,(a0)+
move.l d2,(a0)+
move.l d3,(a0)+
Kalms
Member
#16 - Posted: 16 Dec 2008 17:37 - Edited
Reply Quote
Please, first identify *which* of your read/write instructions will cause actual cache-line read/writes. Saying that "there is 2 cycles of room after a cacheline write" after you've done four memory writes is a bit inaccurate.

If we take this example code:


; assume that a0 is 16-byte aligned
; assume that we run this loop many times
move.l d0,(a0)+
move.l d1,(a0)+
move.l d2,(a0)+
move.l d3,(a0)+


For each iteration through the loop:
0) If the datacache entry it's writing to is already in the cache, the code will run with a throughput if 1 cycle per instruction. The writes will make that cacheline dirty. Loop-iteration done.

but else...

1) If the cacheline its writing to is NOT in the cache, a new cacheline will be allocated. (This happens during the first write.)
2a) if the allocated cacheline is non-dirty, no cacheline-write will happen
2b) if the allocated cacheline is dirty, it will be moved to the store buffer and a cacheline-write is enqueued for later
3) If the datacache is busy with a previous transfer, it will first stall until that previous transfer has completed. (It will probably be a cacheline-write triggered by the previous iteration, of which there is about 18 cycles remaining.)
4) A cacheline-read begins. This takes about 4 cycles per longword x 4 longwords = 16 cycles (in practice about 20 cycles on a B1260, with other (unknown) overhead). When this cacheline-read is finished, a cacheline-write will begin if one was queued up during step 2b. (That will take roughly 20 cycles too.)
5) The first MOVE completes after the first longword of the cacheline-read has completed (takes about 5 cycles).
6) The next MOVE attempts to write to the datacache, to a line which is currently being read in. This will stall the execution unit until the cacheline-read is finished (another 15 cycles).
7) The next two MOVEs will be done while the datacache is idle (except for the push buffer perhaps servicing a cacheline-write in the background). They will complete in 1 cycle each.

So. The right place to add padding is between the first and the second MOVE instruction. Or do cache prewarming using a dummy tst.b (a0) a while earlier. You should be able to have roughly 15 cycles of cpu/bus overlap if you do it properly. (Well, at least 10 cycles.)

One thing to notice here is that it is hard to control when *writes* are being done. actual cacheline-writes are triggered by other reads & writes. The safest way to know it is to have a loop which process much more than 8kB and does no random access; then you can make pretty good statements about which memory accesses will trigger cacheline writes. Otherwise it is better to go by some general rules of thumb and then measure.
_Jamie_
Member
#17 - Posted: 16 Dec 2008 18:09
Reply Quote
Hi Kalms,

I tested only for line that are not in the cache ( with a big buffer )

For the read it seems really constant, if your line is not on the datacache you will have 10 free cycles.

move.l (a0)+,d0
move.l (a0)+,d1
move.l (a0)+,d2
move.l (a0)+,d3
tst.b (a0)

Seems clean for you? 15 cycles seems a lot to me

For the write and for the simple example that we used ( only 4 write ) i tested all possible case, between the move, at the end, at the beginning, i don't found a way to have more that 2 cycles.

Btw for mapping it will be hard to control the writing free cycle.
Kalms
Member
#18 - Posted: 16 Dec 2008 21:19
Reply Quote
jamie: for the read, looks clean. 10 cycles is reasonable. for the write, what you report sounds strange but I don't have my machine accessible this month so I can't do any tests unfortunately.
_Jamie_
Member
#19 - Posted: 16 Dec 2008 21:31
Reply Quote
i hope you right, could be cool to have some new free cycle:)
_Jamie_
Member
#20 - Posted: 17 Dec 2008 02:08
Reply Quote
ok i did some other test.

So it was not 10 cycle for the read but 5 cycle (i lost my mind with my kid:):

rept 4
move.l (a0)+,d0
move.l (a0)+,d1
move.l (a0)+,d2
move.l (a0)+,d3
tst.b (a0)
rept 5
sub.l d4,d4
sub.l d5,d5
endr
endr

with the free cycle after the first move it's 11 cycles:

rept 4
move.l (a0)+,d0
rept 11
sub.l d4,d4
sub.l d5,d5
endr
move.l (a0)+,d1
move.l (a0)+,d2
move.l (a0)+,d3
endr

For the writing it's really strange, i have no accurate result, all seems really weird, the buffer is aligned to 16 bytes, the copyback mode is on but it's always 2 cycles. I have one interruption in the background, maybe i will test without the interrupt
sp_
Member
#21 - Posted: 17 Dec 2008 13:26 - Edited
Reply Quote
I have read the posting from Kalms. If he is right, the optimal loop would be something like this:
Set x=5 and increase until it wont run at copyspeed

; assume that a0 is 16-byte aligned
; assume that we run this loop many times

.loop16

move.l (a0)+,d0 ;A new cacheline is fetched
clr.l d4
REPT x
clr.l d5
clr.l d4
ENDR
move.l (a0)+,d1
move.l (a0)+,d2
move.l (a0)+,d3

move.l d0,(a0)+ ;A new cacheline is fetched
clr.l d5
REPT x
clr.l d4
clr.l d5
ENDR
move.l d1,(a0)+
move.l d2,(a0)+
move.l d3,(a0)+
Boogeyman
Member
#22 - Posted: 17 Dec 2008 13:30
Reply Quote
Life was so much simpler on the 68000 :)

Good luck with your new demo, looking forward to seeing it.
sp_
Member
#23 - Posted: 17 Dec 2008 15:02 - Edited
Reply Quote
The speedtest loop above with linear reads from memory is more suitable for a c2p loop. For texturemapping with big textures the reads are mostly cachemisses. I think you should use a set of innerloops for optimal performance...
.

The writes are linear in memory, so you always will get free cycles for every 16 bytes written. The reads however is difficult to predict.

Use all the free cycles from a cacheline write and spread the rest of the interpolation instructions evenly on each of the 16 reads.

x=Total number of interpolation cycles needed
n=free cycles after a cachelinewrite
For 16 pixels
.loop16
write cacheline
Interpolate n .cycles
read 1 byte
Interpolate (x-n)/16
read 2 byte
Interpolate (x-n)/16
read 3 byte
Interpolate (x-n)/16
...
_Jamie_
Member
#24 - Posted: 17 Dec 2008 15:31
Reply Quote
Actually i did pratically the same tricks for my backface culling.

movem.w (a0)+,d0-d3
tst.b (a0)
backface calculation

For mapping innerloop i will not use this free cycle, because i correct every 8 pixel and i have implemented a working version of the tile swizzle multisize. Now i dream to implement the compressed texture, the innerloop is easy to do but to find the best codebook for texture group it's a real challenge
_Jamie_
Member
#25 - Posted: 17 Dec 2008 15:42
Reply Quote
The storie for all this optimisation start when i started to convert the demo to amiga( yes i code the demo on pc before). I gave texture limitation to my artist, they used a lot of 128*64 or 64*128, i had one of the first unit test ( rotozoom ) running on the amiga and i can see the cache have some conflict when i write the pixel, Fuck i have 3 scene with this texture limitation. Now the only solution that i found for a better cache use is to compress the texture
sp_
Member
#26 - Posted: 19 Dec 2008 15:10 - Edited
Reply Quote
I hope you make a new Demo Jamie! "When we ride on our enemies" was pretty good, but now with 10 years of more experience I think you can do bether. I like the overscan mode used. I think you used it to get the c2p run in 0 DMA?
.
I am working on a amiga 500 demo, but progress is slow. Maybe I release it on breakpoint 2009.

(I said that last year too, so don't count on it. ) ;)
_Jamie_
Member
#27 - Posted: 19 Dec 2008 17:42
Reply Quote
In the party release it's a classic c2p with slow dma:) I released a better version just one month after, but it seems noone has this version.

Yep demo take lot of time and with a job and a familly it's really hard to finish something.

Hope to see you at breakpoint with a crazy a500 demo.
_Jamie_
Member
#28 - Posted: 21 Dec 2008 16:24
Reply Quote
Finally i use the vq compression, it was used for the dreamcast. I have a ratio of 1/2-1/3. It will be interesting to see how it work with a complete scene

 

  Please log in to comment

  

  

  

 

A.D.A. Amiga Demoscene Archive, Version 3.0