A.D.A. Amiga Demoscene Archive

        Welcome guest!

  

  

  

log in with SceneID

  

Demos Amiga Demoscene Archive Forum / Coding / Fastest Memcopy and Memclear on 060

 

Author Message
sp_
Member
#1 - Posted: 23 May 2011 20:06
Reply Quote
What is the fastest way to clear a buffer on 060? And what is the fastest way to copy data?

move16 ?
fmove?
movem.l?

tst.w 31(a0) cache prefetching?

Please show me the optimal loops.
sp_
Member
#2 - Posted: 23 May 2011 20:57 - Edited
Reply Quote
Here is an untested copyloop that use the FPU

Assume A0 and a1 points to two buffers alligned to 16

fmove.d (a0)+,fp0
fmove.d fp0,(a1)+ ;allign memory by 8
bra.b .inn
.loop
fmove.d fp0,(a1)+ ;cached
fmove.d fp1.(a1)+ ;cached
fmove.d fp2,(a1)+ ;cached
fmove.d fp3.(a1)+ ;cached
.inn
tst.w 23(a1) ;fetch 2 cachelines (write) (((a1+8) mod 32)-1))
subq.l #1,d7 ;free
fmove.d (a0)+,fp0 ;cached
fmove.d (a0)+,fp1 ;fetch cacheline
fmove.d (a0)+,fp2 ;cached
fmove.d (a0)+,fp3 ;fetch cacheline
bne.b .loop ;free
sp_
Member
#3 - Posted: 23 May 2011 22:21
Reply Quote
Here is an untested integer version:

move.l (a0)+,(a1)+ ;allign (a0 and a1 mod 16 + 4)
bra.b .inn
.loop
move.l d0,(a1)+ ;cached
move.l d1.(a1)+ ;cached
move.l d2.(a1)+ ;cached
move.l a3.(a1)+ ;cached
.inn
tst.w 12(a1) ;fetch next cacheline

move.l (a0)+,d0 ;cached
move.l (a0)+,d1 ;cached
move.l (a0)+,d2 ;cached
subq.l #1,d7 ;free
move.l (a0)+,a3 ;fetch cacheline
bne.b .loop ;free
jamie2010
Member
#4 - Posted: 24 May 2011 01:31
Reply Quote
The move16 will be faster if you want to copy big chunk of memory.
sp_
Member
#5 - Posted: 24 May 2011 09:29 - Edited
Reply Quote
What about memclear? Wich loop is the fastest?

.loop
fmove.d fp1,(a1)+
fmove.d fp2,(a1)+
fmove.d fp3,(a1)+
fmove.d fp4,(a1)+
subq.l #1,d7
bne.b .loop

or

.loop
move16 (a0),(a1)+
move16 (a0),(a1)+
subq.l #1,d7
bne.b .loop

or

.loop
tst.w 47(a1)
fmove.d fp1,(a1)+
fmove.d fp2,(a1)+
fmove.d fp3,(a1)+
fmove.d fp4,(a1)+
subq.l #1,d7
bne.b .loop

or

.loop
tst.w 47(a1)
move16 (a0),(a1)+
move16 (a0),(a1)+
subq.l #1,d7
bne.b .loop
dalton
Member
#6 - Posted: 24 May 2011 13:07
Reply Quote
I don't think move16 (a0),(a1)+ is a valid instruction... I'd write something like


move.l a0,a1
clr.l (a1)+
clr.l (a1)+
clr.l (a1)+
clr.l (a1)+
.loop
move16 (a0)+,(a1)+
subq.l #1,d0
bgt.b .loop


Move16 should be faster than fmove.d as move16 bypasses cache, there was a thread about it here not long ago.
Blueberry
Member
#7 - Posted: 26 May 2011 13:26
Reply Quote
For details on MOVE16 and other memory access, see this thread.

In summary, the key to copy/clear performance is to avoid reading (i.e. only writing) the destination. The only instruction which accomplishes this is MOVE16. Thus:

If source and destination are aligned identically (relative to 16-byte boundaries), use MOVE16 from source to destination.

If source and destination are aligned differently, use MOVEs to copy 16 bytes from the unaligned source into a fixed, aligned cache line, followed by MOVE16 from this fixed line to the aligned destinatiion.

For clearing, clear a fixed, aligned cache line and copy it repeatedly to the destination using MOVE16.
dalton
Member
#8 - Posted: 27 May 2011 11:49
Reply Quote
So my suggestion was not really good because it will read through the entire buffer. Better to do it like this (although it bothers me that it's not relocatable)


.loop
move16 #zeros,(a0)+
subq.l #1,d0
bgt.b .loop

align 0,16
zeros
dc.l 0,0,0,0
Blueberry
Member
#9 - Posted: 30 May 2011 14:17 - Edited
Reply Quote
Yes, except:
- The addressing mode would be zeros,(a0)+ rather than #zeros,(a0)+. Just a typo, I presume.
- 16-byte aligning a label in an executable does not guarantee 16-byte alignment in memory after loading, since the loader only guarantees 8-byte alignment. This can be fixed by putting some zeros before the zeros label as well, since MOVE16 just rounds the address down.
- The source line must be read into the cache explicitly before the loop. MOVE16 does not put the source line in the cache if it is not already there.

To make the code relocatable, we could have
  lea     zeros(pc),a1
tst.l (a1) ; load line into cache
.loop:
move16 (a1)+,(a0)+
lea -16(a1),a1
subq.l #1,d0
bgt.b .loop

align 0,8
dc.l 0,0
zeros:
dc.l 0,0,0,0

It will be just as fast, since the LEA pairs with the SUB (and the last two cycles of MOVE16 from cached memory can overlap with other instructions anyway).
dalton
Member
#10 - Posted: 30 May 2011 17:29
Reply Quote
that's nice!

also, the fact that move16 needs an explicit cache poke in advance could explain some problems I've been having =)
Cosmos Amiga
Member
#11 - Posted: 15 Aug 2019 12:20 - Edited
Reply Quote
I have to copy a 320x200 screen from fastmem to fastmem with the 68060, and I cannot use move16 !

I need to fill the 8KB with the tst.w and do some move.l (a0)+,(a1)+... And do it again 8 times ?

Or including the tst.w into the .loop ?

What is the fastest routine to copy these 64KB of datas ?

What do you think about that ?

move.w #$0FA0,d2 ; 320 * 200 = 64000 / 16 = 4000 = $FA0

.loop_cache
tst.w 0*16+6(a0)

tst.w 1*16+6(a0)

tst.w 2*16+6(a0)

tst.w 3*16+6(a0) ; preload 4 dcache lines
moveq #4,d1 ; go copy 1 dcache lines

.loop
move.l (a0)+,(a1)+ ; 4 bytes

move.l (a0)+,(a1)+ ; 4 bytes

move.l (a0)+,(a1)+ ; 4 bytes

move.l (a0)+,(a1)+ ; 4 bytes x 4 = 16 bytes = 1 dcache line
subq.b #1,d1

bne.b .loop ; *4
subq.w #4,d2

bne.b .loop_cache ; go preloading 4 new dcache lines


Thanks,
Blueberry
Member
#12 - Posted: 15 Aug 2019 16:34 - Edited
Reply Quote
Read my post in this thread from 26 May 2011.

Why is it that you can't use move16?
Cosmos Amiga
Member
#13 - Posted: 15 Aug 2019 16:47 - Edited
Reply Quote
Because my a0 and a1 are from AllocMem or AlloVec into P96/CybergraphX= not 16-byte boundaries aligned who is a big problem into the Kickstart 3.9...

I checked : the AllocMem return all the time a fastram pointer 8-bytes boundaries aligned...

And the AllocVec, a 4-byte boundaries aligned...

If source and destination are aligned differently, use MOVEs to copy 16 bytes from the unaligned source into a fixed, aligned cache line, followed by MOVE16 from this fixed line to the aligned destinatiion.

Yes, it's a good idea. Need some code to get them aligned, I'm going to think about that...
todi
Member
#14 - Posted: 15 Aug 2019 19:00 - Edited
Reply Quote
Couldn't you just align the memory you are allocating, like this ex.:

		...
add.l #16,d0
jsr _LVOAllocMem(a6)
tst.l d0
beq.s .error
move.l d0,memory_used_for_freemem
add.l #16-1,d0
and.l #$fffffff0,d0
move.l d0,memory_used_for_routine
...
Cosmos Amiga
Member
#15 - Posted: 15 Aug 2019 19:05
Reply Quote
@todi

The memory allocations are into P96 or CybergraphX...
todi
Member
#16 - Posted: 15 Aug 2019 19:17
Reply Quote
Ahh, ok, for CybergraphX, just use AllocMem , align it, and then make your own Bitmap struct with the aligned memory.

 

  Please log in to comment

  

  

  

 

A.D.A. Amiga Demoscene Archive, Version 3.0