A.D.A. Amiga Demoscene Archive

Amiga Demoscene Archive Forum / Coding / compressed mapping

Author	Message
_Jamie_ Member	#1 - Posted: 15 Dec 2008 01:24 Reply Quote I tried to find a way to optimise the mapping innerloop, i tested maybe 20-30 routine to do linear and perspective mapping. Finally i will try to use compressed texture because i have never seen this on amiga and it could be awesome for big texture. i copy some of the innerlop i tested. My rules is to have minimum 16 bits of precision for uv component, keep in mind that my engine don't have overdraw so i can spend more cycle per pixel. //------------------------------------------------ ---- //classic mapping ( 4cycles) //------------------------------------------------ ---- //Lot of demo must use this method on 1200 with 030 //------------------------------------------------ ---- //d0 ffff00uu 16 bits //d1 ffffvvff 24 bits //------------------------------------------------ ---- addx.l d2,d0 addx.l d3,d1 move.b (a0,d4.l),d5 move.w d1,d4 move.b d5,(a1)+ move.b d0,d4 //------------------------------------------------ ---- //Multisize tiled mapping in V ( 4cycles) //------------------------------------------------ ---- //multisize is really better for the cache //------------------------------------------------ ---- //a0 00uuffff 16 bits //a1 vvffffff 24 bits //------------------------------------------------ ---- add.l a2,a0 add.l a3,a1 move.l a1,d0 and.l d6,d0 move.b (a0,d1.l),d1 add.l a0,d0 move.b d1,(a1)+ lsr.l d7,d0 add.l a2,a0 add.l a3,a1 move.l a1,d1 and.l d6,d1 move.b (a0,d0.l),d0 add.l a0,d1 move.b d0,(a1)+ lsr.l d7,d1 //------------------------------------------------ ---- //Multisize swizzled tiled (uv) mapping ( 4,5 cycles) //------------------------------------------------ ---- //i use this one for our next demo //------------------------------------------------ ---- //d0 u00uffff 16 bits //d1 0vvfffff 20 bits //------------------------------------------------ ---- add.l a0,d0 add.l a1,d1 and.l d6,d0 move.l d1,d3 and.l d7,d3 move.b (a5,d2.l),d2 or.l d0,d3 move.b d2,(a6)+ lsr.l d5,d3 add.l a0,d0 add.l a1,d1 and.l d6,d0 move.l d1,d2 and.l d7,d2 move.b (a5,d3.l),d3 or.l d0,d2 move.b d3,(a6)+ lsr.l d5,d2 //------------------------------------------------ ---- //Multisize Compressed Texture Mapping //------------------------------------------------ ---- //Maybe i will use this one for big texture, could be //really cool for the cache.It's the unoptimised //version ( could be done in 6 cycles) and never //tested //------------------------------------------------ ---- //d0 00uuffff 16 bits //d1 vvffffff 20 bits //------------------------------------------------ ---- add.l a2,a0 add.l a3,a1 move.l a1,d0 and.l d6,d0 add.l a0,d0 lsr.l d7,d0 move.w d0,d1 lsr.w #2,d0 and.b #$3,d1 add.w (a0,d0.l),d1 move.b (a1,d1.l),(a2)+
sp_ Member	#2 - Posted: 15 Dec 2008 12:23 - Edited Reply Quote When you plot two pixels per loop, the instructions pipeline well. Tiled textures will improve cachehits and speed. These innerloops will run fast. . A cacheline is 16 bytes. if you plot 16 pixels per loop and place the interpolation after the last write to fastmem, you will get alot of free cycles(while the bus is writing). This will only be useful for bigsize polygons or in a perspective correct mapper corrected for every 16 pixels. Copyback mode must be enabled in the cache. I have outlined a loop below. Didn't finnish it. but I think its around 5 cycles pr pixel. When the cache is busy writing to fastmem the 060 cpu can execute many free cycles. On Mc68030 50mhz a longword fastmem write can pipeline 12 cycles. I guess a 060 can pipeline more.. A cacheline is 4 longword. So 124 cycles are free. The final calculation would be around (516) 80 - 48 cycles = 32 cycles (2 cycles pr pixel.) .loop16 add.l a2,a0 add.l a3,a1 move.l a1,d0 and.l d6,d0 add.l a0,d0 lsr.l d7,d0 add.l a2,a0 add.l a3,a1 move.l a1,d1 and.l d6,d1 add.l a0,d1 lsr.l d7,d1 add.l a2,a0 add.l a3,a1 move.l a1,d2 and.l d6,d2 add.l a0,d2 lsr.l d7,d2 add.l a2,a0 add.l a3,a1 move.l a1,d3 and.l d6,d3 add.l a0,d3 lsr.l d7,d3 swap d0 swap d2 move.w d1,d0 move.w d3,d2 add.l a2,a0 add.l a3,a1 move.l a1,d1 and.l d6,d1 add.l a0,d1 lsr.l d7,d1 add.l a2,a0 add.l a3,a1 move.l a1,d3 and.l d6,d3 add.l a0,d3 lsr.l d7,d3 add.l a2,a0 add.l a3,a1 move.l a1,d4 and.l d6,d4 add.l a0,d4 lsr.l d7,d4 add.l a2,a0 add.l a3,a1 move.l a1,d5 and.l d6,d5 add.l a0,d5 lsr.l d7,d5 swap d1 swap d3 move.w d4,d1 move.w d5,d3 (...) 8 more interpolations move.b (a0,d0.w),(a1)+ swap d0 move.b (a0,d1.w),(a1)+ swap d1 move.b (a0,d2.w),(a1)+ swap d2 move.b (a0,d3.w),(a1)+ swap d3 move.b (a0,d0.w),(a1)+ move.b (a0,d1.w),(a1)+ move.b (a0,d2.w),(a1)+ move.b (a0,d3.w),(a1)+ (...) 8 more writes.
_Jamie_ Member	#3 - Posted: 15 Dec 2008 13:27 Reply Quote move.l a1,d0 and.l d6,d0 add.l a0,d0 lsr.l d7,d0 you have 4 times the same destination register, so it's not paired. You need to interleave the destination register if you want to pair. The move.b ( a0,d0.w),(a1)+ is slower that move.b (a0,d0.l),(a1)+ on 060 ( no idea why)
_Jamie_ Member	#4 - Posted: 15 Dec 2008 14:09 Reply Quote i tested to do some instruction after 16 bytes write, and it seems to not work, fuck it was a good idea
sp_ Member	#5 - Posted: 15 Dec 2008 14:51 - Edited Reply Quote Yes, I was too quick when making this loop. With pairing it will use less cycles. . a1 must be alligned to a 16byte boundry. And copyback mode must be enabled in the datacache. Should work (in theory).. But ..
_Jamie_ Member	#6 - Posted: 15 Dec 2008 15:01 Reply Quote it was aligned and the copyback mode was enabled, theory and practise are really different:)
sp_ Member	#7 - Posted: 15 Dec 2008 15:44 - Edited Reply Quote Ok. If you disable copyback mode and render 4 pixels per loop and push longword to memory it might be faster. The code is not paired for optimal performance. .loop4 move.l d0,(a1)+ add.l a3,a1 add.l a2,a0 move.l a1,d0 and.l d6,d0 add.l a0,d0 lsr.l d7,d0 add.l a3,a1 add.l a2,a0 move.l a1,d1 and.l d6,d1 add.l a0,d1 lsr.l d7,d1 add.l a3,a1 add.l a2,a0 move.l a1,d2 and.l d6,d2 add.l a0,d2 lsr.l d7,d2 add.l a3,a1 add.l a2,a0 move.l a1,d3 and.l d6,d3 add.l a0,d3 lsr.l d7,d3 move.w (a0,d0.l),d0 move.w (a0,d2.l),d2 move.b (a0,d1.l),d0 move.b (a0,d3.l),d2 swap d0 move.w d2,d0 loop
_Jamie_ Member	#8 - Posted: 15 Dec 2008 16:15 Reply Quote it's exactly the same result
_Jamie_ Member	#9 - Posted: 15 Dec 2008 16:17 Reply Quote i mean it doesn't work, you can have only one free instruction after one write
sp_ Member	#10 - Posted: 15 Dec 2008 18:16 - Edited Reply Quote I think I tested this on a 060 10 years ago. If a memwrite is 1 cycle, you are writing to the cache and not to memory. Test the following loop The adds should be "free": lea fastmem,a0 moveq.l #0,d0 move.l #$ffff,d7 .loop move.l d0,(a0)+ add.l d1,d2 add.l d3,d4 add.l d5,d6 dbf d7,.loop
_Jamie_ Member	#11 - Posted: 15 Dec 2008 18:30 Reply Quote i can't test now, but i'm pratically sure that you are writing in the cache if your data cache is enabled, so you don't have free instruction exept when you pair your instruction
sp_ Member	#12 - Posted: 16 Dec 2008 01:40 - Edited Reply Quote The datacache is 8KB. if your program is writing within a 8KB block all writes will be 1 cycle. When you start to write outside the block, the cache will start to push cachelines to memory, and your program will slow down. If d7 is changed to #8192/4 the first run will fill the cache, and the next run will run at 1 cycle since all the memory is already mapped in the cache.
_Jamie_ Member	#13 - Posted: 16 Dec 2008 01:50 Reply Quote that make sense, if i read and the cache line is not loaded i have ~7 free cycles, so why it don't work when i write?
_Jamie_ Member	#14 - Posted: 16 Dec 2008 03:28 Reply Quote ok after some test i have some more precise test move.l (a0)+,d0 move.l (a0)+,d1 move.l (a0)+,d2 move.l (a0)+,d3 tst.b (a0)+ rept 10 clr.l d4 clr.l d5 endr move.l d0,(a0)+ move.l d1,(a0)+ move.l d2,(a0)+ move.l d3,(a0)+ rept 2 clr.l d4 clr.l d5 endr So 10 cycles for the read and 2 cycles for the write
sp_ Member	#15 - Posted: 16 Dec 2008 08:56 - Edited Reply Quote Edit: So this meens 2 cycle delay if the memory is already mapped in the cache, and 10 cycles if not. How about this loop? tst.b (a0) ;preload cacheline rept 10 clr.l d4 clr.l d5 endr move.l (a0)+,d0 move.l (a0)+,d1 move.l (a0)+,d2 move.l (a0)+,d3 tst.b (a0) rept 10 clr.l d4 clr.l d5 endr move.l d0,(a0)+ move.l d1,(a0)+ move.l d2,(a0)+ move.l d3,(a0)+
Kalms Member	#16 - Posted: 16 Dec 2008 17:37 - Edited Reply Quote Please, first identify which of your read/write instructions will cause actual cache-line read/writes. Saying that "there is 2 cycles of room after a cacheline write" after you've done four memory writes is a bit inaccurate. If we take this example code: ; assume that a0 is 16-byte aligned ; assume that we run this loop many times move.l d0,(a0)+ move.l d1,(a0)+ move.l d2,(a0)+ move.l d3,(a0)+ For each iteration through the loop: 0) If the datacache entry it's writing to is already in the cache, the code will run with a throughput if 1 cycle per instruction. The writes will make that cacheline dirty. Loop-iteration done. but else... 1) If the cacheline its writing to is NOT in the cache, a new cacheline will be allocated. (This happens during the first write.) 2a) if the allocated cacheline is non-dirty, no cacheline-write will happen 2b) if the allocated cacheline is dirty, it will be moved to the store buffer and a cacheline-write is enqueued for later 3) If the datacache is busy with a previous transfer, it will first stall until that previous transfer has completed. (It will probably be a cacheline-write triggered by the previous iteration, of which there is about 18 cycles remaining.) 4) A cacheline-read begins. This takes about 4 cycles per longword x 4 longwords = 16 cycles (in practice about 20 cycles on a B1260, with other (unknown) overhead). When this cacheline-read is finished, a cacheline-write will begin if one was queued up during step 2b. (That will take roughly 20 cycles too.) 5) The first MOVE completes after the first longword of the cacheline-read has completed (takes about 5 cycles). 6) The next MOVE attempts to write to the datacache, to a line which is currently being read in. This will stall the execution unit until the cacheline-read is finished (another 15 cycles). 7) The next two MOVEs will be done while the datacache is idle (except for the push buffer perhaps servicing a cacheline-write in the background). They will complete in 1 cycle each. So. The right place to add padding is between the first and the second MOVE instruction. Or do cache prewarming using a dummy tst.b (a0) a while earlier. You should be able to have roughly 15 cycles of cpu/bus overlap if you do it properly. (Well, at least 10 cycles.) One thing to notice here is that it is hard to control when writes are being done. actual cacheline-writes are triggered by other reads & writes. The safest way to know it is to have a loop which process much more than 8kB and does no random access; then you can make pretty good statements about which memory accesses will trigger cacheline writes. Otherwise it is better to go by some general rules of thumb and then measure.
_Jamie_ Member	#17 - Posted: 16 Dec 2008 18:09 Reply Quote Hi Kalms, I tested only for line that are not in the cache ( with a big buffer ) For the read it seems really constant, if your line is not on the datacache you will have 10 free cycles. move.l (a0)+,d0 move.l (a0)+,d1 move.l (a0)+,d2 move.l (a0)+,d3 tst.b (a0) Seems clean for you? 15 cycles seems a lot to me For the write and for the simple example that we used ( only 4 write ) i tested all possible case, between the move, at the end, at the beginning, i don't found a way to have more that 2 cycles. Btw for mapping it will be hard to control the writing free cycle.
Kalms Member	#18 - Posted: 16 Dec 2008 21:19 Reply Quote jamie: for the read, looks clean. 10 cycles is reasonable. for the write, what you report sounds strange but I don't have my machine accessible this month so I can't do any tests unfortunately.
_Jamie_ Member	#19 - Posted: 16 Dec 2008 21:31 Reply Quote i hope you right, could be cool to have some new free cycle:)
_Jamie_ Member	#20 - Posted: 17 Dec 2008 02:08 Reply Quote ok i did some other test. So it was not 10 cycle for the read but 5 cycle (i lost my mind with my kid:): rept 4 move.l (a0)+,d0 move.l (a0)+,d1 move.l (a0)+,d2 move.l (a0)+,d3 tst.b (a0) rept 5 sub.l d4,d4 sub.l d5,d5 endr endr with the free cycle after the first move it's 11 cycles: rept 4 move.l (a0)+,d0 rept 11 sub.l d4,d4 sub.l d5,d5 endr move.l (a0)+,d1 move.l (a0)+,d2 move.l (a0)+,d3 endr For the writing it's really strange, i have no accurate result, all seems really weird, the buffer is aligned to 16 bytes, the copyback mode is on but it's always 2 cycles. I have one interruption in the background, maybe i will test without the interrupt
sp_ Member	#21 - Posted: 17 Dec 2008 13:26 - Edited Reply Quote I have read the posting from Kalms. If he is right, the optimal loop would be something like this: Set x=5 and increase until it wont run at copyspeed ; assume that a0 is 16-byte aligned ; assume that we run this loop many times .loop16 move.l (a0)+,d0 ;A new cacheline is fetched clr.l d4 REPT x clr.l d5 clr.l d4 ENDR move.l (a0)+,d1 move.l (a0)+,d2 move.l (a0)+,d3 move.l d0,(a0)+ ;A new cacheline is fetched clr.l d5 REPT x clr.l d4 clr.l d5 ENDR move.l d1,(a0)+ move.l d2,(a0)+ move.l d3,(a0)+
Boogeyman Member	#22 - Posted: 17 Dec 2008 13:30 Reply Quote Life was so much simpler on the 68000 :) Good luck with your new demo, looking forward to seeing it.
sp_ Member	#23 - Posted: 17 Dec 2008 15:02 - Edited Reply Quote The speedtest loop above with linear reads from memory is more suitable for a c2p loop. For texturemapping with big textures the reads are mostly cachemisses. I think you should use a set of innerloops for optimal performance... . The writes are linear in memory, so you always will get free cycles for every 16 bytes written. The reads however is difficult to predict. Use all the free cycles from a cacheline write and spread the rest of the interpolation instructions evenly on each of the 16 reads. x=Total number of interpolation cycles needed n=free cycles after a cachelinewrite For 16 pixels .loop16 write cacheline Interpolate n .cycles read 1 byte Interpolate (x-n)/16 read 2 byte Interpolate (x-n)/16 read 3 byte Interpolate (x-n)/16 ...
_Jamie_ Member	#24 - Posted: 17 Dec 2008 15:31 Reply Quote Actually i did pratically the same tricks for my backface culling. movem.w (a0)+,d0-d3 tst.b (a0) backface calculation For mapping innerloop i will not use this free cycle, because i correct every 8 pixel and i have implemented a working version of the tile swizzle multisize. Now i dream to implement the compressed texture, the innerloop is easy to do but to find the best codebook for texture group it's a real challenge
_Jamie_ Member	#25 - Posted: 17 Dec 2008 15:42 Reply Quote The storie for all this optimisation start when i started to convert the demo to amiga( yes i code the demo on pc before). I gave texture limitation to my artist, they used a lot of 12864 or 64128, i had one of the first unit test ( rotozoom ) running on the amiga and i can see the cache have some conflict when i write the pixel, Fuck i have 3 scene with this texture limitation. Now the only solution that i found for a better cache use is to compress the texture
sp_ Member	#26 - Posted: 19 Dec 2008 15:10 - Edited Reply Quote I hope you make a new Demo Jamie! "When we ride on our enemies" was pretty good, but now with 10 years of more experience I think you can do bether. I like the overscan mode used. I think you used it to get the c2p run in 0 DMA? . I am working on a amiga 500 demo, but progress is slow. Maybe I release it on breakpoint 2009. (I said that last year too, so don't count on it. ) ;)
_Jamie_ Member	#27 - Posted: 19 Dec 2008 17:42 Reply Quote In the party release it's a classic c2p with slow dma:) I released a better version just one month after, but it seems noone has this version. Yep demo take lot of time and with a job and a familly it's really hard to finish something. Hope to see you at breakpoint with a crazy a500 demo.
_Jamie_ Member	#28 - Posted: 21 Dec 2008 16:24 Reply Quote Finally i use the vq compression, it was used for the dreamcast. I have a ratio of 1/2-1/3. It will be interesting to see how it work with a complete scene

A.D.A. Amiga Demoscene Archive, Version 3.0