|
Author |
Message |
sp_
Member |
What is the fastest way to clear a buffer on 060? And what is the fastest way to copy data?
move16 ? fmove? movem.l?
tst.w 31(a0) cache prefetching?
Please show me the optimal loops.
|
sp_
Member |
Here is an untested copyloop that use the FPU Assume A0 and a1 points to two buffers alligned to 16 fmove.d (a0)+,fp0 fmove.d fp0,(a1)+ ;allign memory by 8 bra.b .inn .loop fmove.d fp0,(a1)+ ;cached fmove.d fp1.(a1)+ ;cached fmove.d fp2,(a1)+ ;cached fmove.d fp3.(a1)+ ;cached .inn tst.w 23(a1) ;fetch 2 cachelines (write) (((a1+8) mod 32)-1)) subq.l #1,d7 ;free fmove.d (a0)+,fp0 ;cached fmove.d (a0)+,fp1 ;fetch cacheline fmove.d (a0)+,fp2 ;cached fmove.d (a0)+,fp3 ;fetch cacheline bne.b .loop ;free
|
sp_
Member |
Here is an untested integer version:
move.l (a0)+,(a1)+ ;allign (a0 and a1 mod 16 + 4) bra.b .inn .loop move.l d0,(a1)+ ;cached move.l d1.(a1)+ ;cached move.l d2.(a1)+ ;cached move.l a3.(a1)+ ;cached .inn tst.w 12(a1) ;fetch next cacheline
move.l (a0)+,d0 ;cached move.l (a0)+,d1 ;cached move.l (a0)+,d2 ;cached subq.l #1,d7 ;free move.l (a0)+,a3 ;fetch cacheline bne.b .loop ;free
|
jamie2010
Member |
The move16 will be faster if you want to copy big chunk of memory.
|
sp_
Member |
What about memclear? Wich loop is the fastest?
.loop fmove.d fp1,(a1)+ fmove.d fp2,(a1)+ fmove.d fp3,(a1)+ fmove.d fp4,(a1)+ subq.l #1,d7 bne.b .loop
or
.loop move16 (a0),(a1)+ move16 (a0),(a1)+ subq.l #1,d7 bne.b .loop
or
.loop tst.w 47(a1) fmove.d fp1,(a1)+ fmove.d fp2,(a1)+ fmove.d fp3,(a1)+ fmove.d fp4,(a1)+ subq.l #1,d7 bne.b .loop
or
.loop tst.w 47(a1) move16 (a0),(a1)+ move16 (a0),(a1)+ subq.l #1,d7 bne.b .loop
|
dalton
Member |
I don't think move16 (a0),(a1)+ is a valid instruction... I'd write something like move.l a0,a1 clr.l (a1)+ clr.l (a1)+ clr.l (a1)+ clr.l (a1)+ .loop move16 (a0)+,(a1)+ subq.l #1,d0 bgt.b .loop
Move16 should be faster than fmove.d as move16 bypasses cache, there was a thread about it here not long ago.
|
Blueberry
Member |
For details on MOVE16 and other memory access, see this thread. In summary, the key to copy/clear performance is to avoid reading (i.e. only writing) the destination. The only instruction which accomplishes this is MOVE16. Thus: If source and destination are aligned identically (relative to 16-byte boundaries), use MOVE16 from source to destination. If source and destination are aligned differently, use MOVEs to copy 16 bytes from the unaligned source into a fixed, aligned cache line, followed by MOVE16 from this fixed line to the aligned destinatiion. For clearing, clear a fixed, aligned cache line and copy it repeatedly to the destination using MOVE16.
|
dalton
Member |
So my suggestion was not really good because it will read through the entire buffer. Better to do it like this (although it bothers me that it's not relocatable) .loop move16 #zeros,(a0)+ subq.l #1,d0 bgt.b .loop
align 0,16 zeros dc.l 0,0,0,0
|
Blueberry
Member |
Yes, except: - The addressing mode would be zeros,(a0)+ rather than #zeros,(a0)+. Just a typo, I presume. - 16-byte aligning a label in an executable does not guarantee 16-byte alignment in memory after loading, since the loader only guarantees 8-byte alignment. This can be fixed by putting some zeros before the zeros label as well, since MOVE16 just rounds the address down. - The source line must be read into the cache explicitly before the loop. MOVE16 does not put the source line in the cache if it is not already there. To make the code relocatable, we could have lea zeros(pc),a1 tst.l (a1) ; load line into cache .loop: move16 (a1)+,(a0)+ lea -16(a1),a1 subq.l #1,d0 bgt.b .loop
align 0,8 dc.l 0,0 zeros: dc.l 0,0,0,0 It will be just as fast, since the LEA pairs with the SUB (and the last two cycles of MOVE16 from cached memory can overlap with other instructions anyway).
|
dalton
Member |
that's nice!
also, the fact that move16 needs an explicit cache poke in advance could explain some problems I've been having =)
|
Cosmos Amiga
Member |
I have to copy a 320x200 screen from fastmem to fastmem with the 68060, and I cannot use move16 ! I need to fill the 8KB with the tst.w and do some move.l (a0)+,(a1)+... And do it again 8 times ? Or including the tst.w into the .loop ? What is the fastest routine to copy these 64KB of datas ? What do you think about that ? move.w #$0FA0,d2 ; 320 * 200 = 64000 / 16 = 4000 = $FA0
.loop_cache tst.w 0*16+6(a0)
tst.w 1*16+6(a0)
tst.w 2*16+6(a0)
tst.w 3*16+6(a0) ; preload 4 dcache lines moveq #4,d1 ; go copy 1 dcache lines
.loop move.l (a0)+,(a1)+ ; 4 bytes
move.l (a0)+,(a1)+ ; 4 bytes
move.l (a0)+,(a1)+ ; 4 bytes
move.l (a0)+,(a1)+ ; 4 bytes x 4 = 16 bytes = 1 dcache line subq.b #1,d1
bne.b .loop ; *4 subq.w #4,d2
bne.b .loop_cache ; go preloading 4 new dcache lines
Thanks,
|
Blueberry
Member |
Read my post in this thread from 26 May 2011.
Why is it that you can't use move16?
|
Cosmos Amiga
Member |
Because my a0 and a1 are from AllocMem or AlloVec into P96/CybergraphX= not 16-byte boundaries aligned who is a big problem into the Kickstart 3.9... I checked : the AllocMem return all the time a fastram pointer 8-bytes boundaries aligned... And the AllocVec, a 4-byte boundaries aligned... If source and destination are aligned differently, use MOVEs to copy 16 bytes from the unaligned source into a fixed, aligned cache line, followed by MOVE16 from this fixed line to the aligned destinatiion. Yes, it's a good idea. Need some code to get them aligned, I'm going to think about that...
|
todi
Member |
Couldn't you just align the memory you are allocating, like this ex.: ... add.l #16,d0 jsr _LVOAllocMem(a6) tst.l d0 beq.s .error move.l d0,memory_used_for_freemem add.l #16-1,d0 and.l #$fffffff0,d0 move.l d0,memory_used_for_routine ...
|
Cosmos Amiga
Member |
@todi
The memory allocations are into P96 or CybergraphX...
|
todi
Member |
Ahh, ok, for CybergraphX, just use AllocMem , align it, and then make your own Bitmap struct with the aligned memory.
|
|
|