A.D.A. Amiga Demoscene Archive

  Welcome guest! Please register a new account or log in

  

  

  

Demos Amiga Demoscene Archive Forum / Coding / Looking for good coders for improving the gfx Quake copy routine

 

Author Message
Cosmos Amiga
Member
#1 - Posted: 27 Mar 2018 18:10 - Edited
Reply Quote
I have this graphics copy routine from fastram to fastram (RTG) into my resourced asm source of the old Quake1 from PXL Computers : the goal is to optimize it to reach maximum speed using 68060...

First, the ugly 512 movem were reduced to only 4 because of the code cache and the loops buffer...

Next, what do you suggest ?

- move.l instead movem.l ?
- prefecht read lines data cache ?
- other ideas ?



; (d0.w/d1.w/d2.w/d3.w/d4.w/d5.w/a0/a1) ()

_CopyMemQuake1
movem.l d0-d7/a0-a6,-(sp)

subq.w #1,d4
tst.w d1

beq.b .skip_mulu
move.w d1,d6

mulu.w d2,d6

add.l d6,a0
add.l d6,a1

.skip_mulu
move.w d3,d1
lsr.w #5,d1 ; /32

subq.w #1,d1
eor.w #$007F,d1

lsl.w #2,d1
move.w d1,d0

lsl.w #1,d1
add.w d0,d1

sub.w d3,d5
cmp.w d2,d3

beq.b .quick_movem
add.w d0,a0

add.w d0,a1
sub.w d3,d2

lsr.w #1,d3 ; /2
and.w #$000F,d3

swap d4

move.w d3,d4

swap d4

.jmp_a2
sub.w d1,d3

movem.l (a0)+,d0/d6-d7/a2-a6

movem.l d0/d6-d7/a2-a6,(a1)

lea $20(a1),a1
subq.w #1,d3

bcc.b .loop_w

bra.b .continue_loop_bf

.loop_bf
bfextu d4{0:16},d3

bra.b .jmp_a2

.loop_w
move.w (a0)+,(a1)+
subq.w #1,d3

bcc.b .loop_w

.continue_loop_bf
add.w d2,a0
add.w d5,a1

subq.w #1,d4

bcc.b .loop_bf

movem.l (sp)+,d0-d7/a0-a6

rts

.quick_movem
sub.w d1,d4

.loop_qmovem
movem.l (a0)+,d1/d6-d7/a2-a6 ; 128 movem in total

movem.l d1/d6-d7/a2-a6,(a1) ; 1 movem move 8*4 = 32 bytes

lea $20(a1),a1
add.w d5,a1

subq.w #1,d4

bcc.b .loop_qmovem

movem.l (sp)+,d0-d7/a0-a6

rts

todi
Member
#2 - Posted: 27 Mar 2018 20:51
Reply Quote
Take a look at this old threads:
http://ada.untergrund.net/?p=boardthread&id=613
http://ada.untergrund.net/?p=boardthread&id=585
But if you are copying from fast -> chip, you won't get any speed gain from MOVE16.

One other thing is to copy from fast -> chip when screen DMA is off:
https://amycoders.org/opt/fasttruec2p.html

If you want to optimize even further, try weaving non memory accessing calculations (CPU & FPU) in between copy instructions and always copy longword
Cosmos Amiga
Member
#3 - Posted: 28 Mar 2018 10:43
Reply Quote
Thank you Todi !

I did a little mistake : this routine is used ony under RTG only, so it's fastram to fastram...

There is the same routine using move16 if 040/060 detected (you have to choose that in the Render Routine menu)

After some investigations, only the .quick_movem part is used, and d5 is always zero...

New code updated :


.quick_movem
sub.w d1,d4

.loop_qmovem
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
subq.w #1,d4

bcc.b .loop_qmovem

movem.l (sp)+,d0-d7/a0-a6

rts


I'm going to do some benchs using my 1200 + Mediator...
noname
Member
#4 - Posted: 28 Mar 2018 13:53 - Edited
Reply Quote
Consider using (or at least benchmarking against) CopyMem and CopyMemQuick (a bit more restrictive with alignments). Happy users could then even patch their system, e.g. with NewCMQ060 and benefit from it. Check that link also for a good speed comparision table and the source.
hellfire
Member
#5 - Posted: 28 Mar 2018 14:19 - Edited
Reply Quote
Cosmos Amiga:
move.l (a0)+,(a1)+

On 060 you get only one cache-access per cycle. So the move takes two cycles, one for the read and one for the write.
If you have something else to do, you could split the read & write into two separate instructions and get two free instruction slots in the secondary execution pipeline.
Cosmos Amiga
Member
#6 - Posted: 5 Oct 2018 09:37 - Edited
Reply Quote
I finally tried, and zero speedup...


So, I removed all the missing 040/060 fpu opcodes (fsin, fcos & fsincos) and replaced the emulated fsqrt by a real fsqrt.x, and I get a little speedup for the 68060.

Here to download : http://warpclassic68k.blogspot.com/p/blog-page.html


Yes, there is some emulated fpu routines inside, don't know why exactly...

fsqrt.x is well inside the 68881/2, 68040 and 68060, so what ???...


Some guys here know the coder Max (Maximo Piva) to ask ?

 

  Please register a new account or log in to comment

  

  

  

 

A.D.A. Amiga Demoscene Archive, Version 3.0