A.D.A. Amiga Demoscene Archive

Amiga Demoscene Archive Forum / Coding / Fastest Memcopy and Memclear on 060

Author	Message
sp_ Member	#1 - Posted: 23 May 2011 20:06 Reply Quote What is the fastest way to clear a buffer on 060? And what is the fastest way to copy data? move16 ? fmove? movem.l? tst.w 31(a0) cache prefetching? Please show me the optimal loops.
sp_ Member	#2 - Posted: 23 May 2011 20:57 - Edited Reply Quote Here is an untested copyloop that use the FPU Assume A0 and a1 points to two buffers alligned to 16 fmove.d (a0)+,fp0 fmove.d fp0,(a1)+ ;allign memory by 8 bra.b .inn .loop fmove.d fp0,(a1)+ ;cached fmove.d fp1.(a1)+ ;cached fmove.d fp2,(a1)+ ;cached fmove.d fp3.(a1)+ ;cached .inn tst.w 23(a1) ;fetch 2 cachelines (write) (((a1+8) mod 32)-1)) subq.l #1,d7 ;free fmove.d (a0)+,fp0 ;cached fmove.d (a0)+,fp1 ;fetch cacheline fmove.d (a0)+,fp2 ;cached fmove.d (a0)+,fp3 ;fetch cacheline bne.b .loop ;free
sp_ Member	#3 - Posted: 23 May 2011 22:21 Reply Quote Here is an untested integer version: move.l (a0)+,(a1)+ ;allign (a0 and a1 mod 16 + 4) bra.b .inn .loop move.l d0,(a1)+ ;cached move.l d1.(a1)+ ;cached move.l d2.(a1)+ ;cached move.l a3.(a1)+ ;cached .inn tst.w 12(a1) ;fetch next cacheline move.l (a0)+,d0 ;cached move.l (a0)+,d1 ;cached move.l (a0)+,d2 ;cached subq.l #1,d7 ;free move.l (a0)+,a3 ;fetch cacheline bne.b .loop ;free
jamie2010 Member	#4 - Posted: 24 May 2011 01:31 Reply Quote The move16 will be faster if you want to copy big chunk of memory.
sp_ Member	#5 - Posted: 24 May 2011 09:29 - Edited Reply Quote What about memclear? Wich loop is the fastest? .loop fmove.d fp1,(a1)+ fmove.d fp2,(a1)+ fmove.d fp3,(a1)+ fmove.d fp4,(a1)+ subq.l #1,d7 bne.b .loop or .loop move16 (a0),(a1)+ move16 (a0),(a1)+ subq.l #1,d7 bne.b .loop or .loop tst.w 47(a1) fmove.d fp1,(a1)+ fmove.d fp2,(a1)+ fmove.d fp3,(a1)+ fmove.d fp4,(a1)+ subq.l #1,d7 bne.b .loop or .loop tst.w 47(a1) move16 (a0),(a1)+ move16 (a0),(a1)+ subq.l #1,d7 bne.b .loop
dalton Member	#6 - Posted: 24 May 2011 13:07 Reply Quote I don't think move16 (a0),(a1)+ is a valid instruction... I'd write something like move.l a0,a1 clr.l (a1)+ clr.l (a1)+ clr.l (a1)+ clr.l (a1)+ .loop move16 (a0)+,(a1)+ subq.l #1,d0 bgt.b .loop Move16 should be faster than fmove.d as move16 bypasses cache, there was a thread about it here not long ago.
Blueberry Member	#7 - Posted: 26 May 2011 13:26 Reply Quote For details on MOVE16 and other memory access, see this thread. In summary, the key to copy/clear performance is to avoid reading (i.e. only writing) the destination. The only instruction which accomplishes this is MOVE16. Thus: If source and destination are aligned identically (relative to 16-byte boundaries), use MOVE16 from source to destination. If source and destination are aligned differently, use MOVEs to copy 16 bytes from the unaligned source into a fixed, aligned cache line, followed by MOVE16 from this fixed line to the aligned destinatiion. For clearing, clear a fixed, aligned cache line and copy it repeatedly to the destination using MOVE16.
dalton Member	#8 - Posted: 27 May 2011 11:49 Reply Quote So my suggestion was not really good because it will read through the entire buffer. Better to do it like this (although it bothers me that it's not relocatable) .loop move16 #zeros,(a0)+ subq.l #1,d0 bgt.b .loop align 0,16 zeros dc.l 0,0,0,0
Blueberry Member	#9 - Posted: 30 May 2011 14:17 - Edited Reply Quote Yes, except: - The addressing mode would be zeros,(a0)+ rather than #zeros,(a0)+. Just a typo, I presume. - 16-byte aligning a label in an executable does not guarantee 16-byte alignment in memory after loading, since the loader only guarantees 8-byte alignment. This can be fixed by putting some zeros before the zeros label as well, since MOVE16 just rounds the address down. - The source line must be read into the cache explicitly before the loop. MOVE16 does not put the source line in the cache if it is not already there. To make the code relocatable, we could have lea zeros(pc),a1 tst.l (a1) ; load line into cache .loop: move16 (a1)+,(a0)+ lea -16(a1),a1 subq.l #1,d0 bgt.b .loop align 0,8 dc.l 0,0 zeros: dc.l 0,0,0,0 It will be just as fast, since the LEA pairs with the SUB (and the last two cycles of MOVE16 from cached memory can overlap with other instructions anyway).
dalton Member	#10 - Posted: 30 May 2011 17:29 Reply Quote that's nice! also, the fact that move16 needs an explicit cache poke in advance could explain some problems I've been having =)
Cosmos Amiga Member	#11 - Posted: 15 Aug 2019 12:20 - Edited Reply Quote I have to copy a 320x200 screen from fastmem to fastmem with the 68060, and I cannot use move16 ! I need to fill the 8KB with the tst.w and do some move.l (a0)+,(a1)+... And do it again 8 times ? Or including the tst.w into the .loop ? What is the fastest routine to copy these 64KB of datas ? What do you think about that ? move.w #$0FA0,d2 ; 320 * 200 = 64000 / 16 = 4000 = $FA0 .loop_cache tst.w 016+6(a0) tst.w 116+6(a0) tst.w 216+6(a0) tst.w 316+6(a0) ; preload 4 dcache lines moveq #4,d1 ; go copy 1 dcache lines .loop move.l (a0)+,(a1)+ ; 4 bytes move.l (a0)+,(a1)+ ; 4 bytes move.l (a0)+,(a1)+ ; 4 bytes move.l (a0)+,(a1)+ ; 4 bytes x 4 = 16 bytes = 1 dcache line subq.b #1,d1 bne.b .loop ; *4 subq.w #4,d2 bne.b .loop_cache ; go preloading 4 new dcache lines Thanks,
Blueberry Member	#12 - Posted: 15 Aug 2019 16:34 - Edited Reply Quote Read my post in this thread from 26 May 2011. Why is it that you can't use move16?
Cosmos Amiga Member	#13 - Posted: 15 Aug 2019 16:47 - Edited Reply Quote Because my a0 and a1 are from AllocMem or AlloVec into P96/CybergraphX= not 16-byte boundaries aligned who is a big problem into the Kickstart 3.9... I checked : the AllocMem return all the time a fastram pointer 8-bytes boundaries aligned... And the AllocVec, a 4-byte boundaries aligned... If source and destination are aligned differently, use MOVEs to copy 16 bytes from the unaligned source into a fixed, aligned cache line, followed by MOVE16 from this fixed line to the aligned destinatiion. Yes, it's a good idea. Need some code to get them aligned, I'm going to think about that...
todi Member	#14 - Posted: 15 Aug 2019 19:00 - Edited Reply Quote Couldn't you just align the memory you are allocating, like this ex.: ... add.l #16,d0 jsr _LVOAllocMem(a6) tst.l d0 beq.s .error move.l d0,memory_used_for_freemem add.l #16-1,d0 and.l #$fffffff0,d0 move.l d0,memory_used_for_routine ...
Cosmos Amiga Member	#15 - Posted: 15 Aug 2019 19:05 Reply Quote @todi The memory allocations are into P96 or CybergraphX...
todi Member	#16 - Posted: 15 Aug 2019 19:17 Reply Quote Ahh, ok, for CybergraphX, just use AllocMem , align it, and then make your own Bitmap struct with the aligned memory.

A.D.A. Amiga Demoscene Archive, Version 3.0