A.D.A. Amiga Demoscene Archive

  Welcome guest! Please register a new account or log in

  

  

  

log in with SceneID

  

Demos Amiga Demoscene Archive Forum / Coding / Fastest Memcopy and Memclear on 060

 

Author Message
sp_
Member
#1 - Posted: 23 May 2011 20:06
Reply Quote
What is the fastest way to clear a buffer on 060? And what is the fastest way to copy data?

move16 ?
fmove?
movem.l?

tst.w 31(a0) cache prefetching?

Please show me the optimal loops.
sp_
Member
#2 - Posted: 23 May 2011 20:57 - Edited
Reply Quote
Here is an untested copyloop that use the FPU

Assume A0 and a1 points to two buffers alligned to 16

fmove.d (a0)+,fp0
fmove.d fp0,(a1)+ ;allign memory by 8
bra.b .inn
.loop
fmove.d fp0,(a1)+ ;cached
fmove.d fp1.(a1)+ ;cached
fmove.d fp2,(a1)+ ;cached
fmove.d fp3.(a1)+ ;cached
.inn
tst.w 23(a1) ;fetch 2 cachelines (write) (((a1+8) mod 32)-1))
subq.l #1,d7 ;free
fmove.d (a0)+,fp0 ;cached
fmove.d (a0)+,fp1 ;fetch cacheline
fmove.d (a0)+,fp2 ;cached
fmove.d (a0)+,fp3 ;fetch cacheline
bne.b .loop ;free
sp_
Member
#3 - Posted: 23 May 2011 22:21
Reply Quote
Here is an untested integer version:

move.l (a0)+,(a1)+ ;allign (a0 and a1 mod 16 + 4)
bra.b .inn
.loop
move.l d0,(a1)+ ;cached
move.l d1.(a1)+ ;cached
move.l d2.(a1)+ ;cached
move.l a3.(a1)+ ;cached
.inn
tst.w 12(a1) ;fetch next cacheline

move.l (a0)+,d0 ;cached
move.l (a0)+,d1 ;cached
move.l (a0)+,d2 ;cached
subq.l #1,d7 ;free
move.l (a0)+,a3 ;fetch cacheline
bne.b .loop ;free
jamie2010
Member
#4 - Posted: 24 May 2011 01:31
Reply Quote
The move16 will be faster if you want to copy big chunk of memory.
sp_
Member
#5 - Posted: 24 May 2011 09:29 - Edited
Reply Quote
What about memclear? Wich loop is the fastest?

.loop
fmove.d fp1,(a1)+
fmove.d fp2,(a1)+
fmove.d fp3,(a1)+
fmove.d fp4,(a1)+
subq.l #1,d7
bne.b .loop

or

.loop
move16 (a0),(a1)+
move16 (a0),(a1)+
subq.l #1,d7
bne.b .loop

or

.loop
tst.w 47(a1)
fmove.d fp1,(a1)+
fmove.d fp2,(a1)+
fmove.d fp3,(a1)+
fmove.d fp4,(a1)+
subq.l #1,d7
bne.b .loop

or

.loop
tst.w 47(a1)
move16 (a0),(a1)+
move16 (a0),(a1)+
subq.l #1,d7
bne.b .loop
dalton
Member
#6 - Posted: 24 May 2011 13:07
Reply Quote
I don't think move16 (a0),(a1)+ is a valid instruction... I'd write something like


move.l a0,a1
clr.l (a1)+
clr.l (a1)+
clr.l (a1)+
clr.l (a1)+
.loop
move16 (a0)+,(a1)+
subq.l #1,d0
bgt.b .loop


Move16 should be faster than fmove.d as move16 bypasses cache, there was a thread about it here not long ago.
Blueberry
Member
#7 - Posted: 26 May 2011 13:26
Reply Quote
For details on MOVE16 and other memory access, see this thread.

In summary, the key to copy/clear performance is to avoid reading (i.e. only writing) the destination. The only instruction which accomplishes this is MOVE16. Thus:

If source and destination are aligned identically (relative to 16-byte boundaries), use MOVE16 from source to destination.

If source and destination are aligned differently, use MOVEs to copy 16 bytes from the unaligned source into a fixed, aligned cache line, followed by MOVE16 from this fixed line to the aligned destinatiion.

For clearing, clear a fixed, aligned cache line and copy it repeatedly to the destination using MOVE16.
dalton
Member
#8 - Posted: 27 May 2011 11:49
Reply Quote
So my suggestion was not really good because it will read through the entire buffer. Better to do it like this (although it bothers me that it's not relocatable)


.loop
move16 #zeros,(a0)+
subq.l #1,d0
bgt.b .loop

align 0,16
zeros
dc.l 0,0,0,0
Blueberry
Member
#9 - Posted: 30 May 2011 14:17 - Edited
Reply Quote
Yes, except:
- The addressing mode would be zeros,(a0)+ rather than #zeros,(a0)+. Just a typo, I presume.
- 16-byte aligning a label in an executable does not guarantee 16-byte alignment in memory after loading, since the loader only guarantees 8-byte alignment. This can be fixed by putting some zeros before the zeros label as well, since MOVE16 just rounds the address down.
- The source line must be read into the cache explicitly before the loop. MOVE16 does not put the source line in the cache if it is not already there.

To make the code relocatable, we could have
  lea     zeros(pc),a1
tst.l (a1) ; load line into cache
.loop:
move16 (a1)+,(a0)+
lea -16(a1),a1
subq.l #1,d0
bgt.b .loop

align 0,8
dc.l 0,0
zeros:
dc.l 0,0,0,0

It will be just as fast, since the LEA pairs with the SUB (and the last two cycles of MOVE16 from cached memory can overlap with other instructions anyway).
dalton
Member
#10 - Posted: 30 May 2011 17:29
Reply Quote
that's nice!

also, the fact that move16 needs an explicit cache poke in advance could explain some problems I've been having =)

 

  Please register a new account or log in to comment

  

  

  

 

A.D.A. Amiga Demoscene Archive, Version 3.0