A.D.A. Amiga Demoscene Archive

        Welcome guest!

  

  

  

log in with SceneID

  

Demos Amiga Demoscene Archive Forum / Coding / Looking for good coders for improving the gfx Quake copy routine

 

Author Message
Cosmos Amiga
Member
#1 - Posted: 27 Mar 2018 18:10 - Edited
Reply Quote
I have this graphics copy routine from fastram to fastram (RTG) into my resourced asm source of the old Quake1 from PXL Computers : the goal is to optimize it to reach maximum speed using 68060...

First, the ugly 512 movem were reduced to only 4 because of the code cache and the loops buffer...

Next, what do you suggest ?

- move.l instead movem.l ?
- prefecht read lines data cache ?
- other ideas ?



; (d0.w/d1.w/d2.w/d3.w/d4.w/d5.w/a0/a1) ()

_CopyMemQuake1
movem.l d0-d7/a0-a6,-(sp)

subq.w #1,d4
tst.w d1

beq.b .skip_mulu
move.w d1,d6

mulu.w d2,d6

add.l d6,a0
add.l d6,a1

.skip_mulu
move.w d3,d1
lsr.w #5,d1 ; /32

subq.w #1,d1
eor.w #$007F,d1

lsl.w #2,d1
move.w d1,d0

lsl.w #1,d1
add.w d0,d1

sub.w d3,d5
cmp.w d2,d3

beq.b .quick_movem
add.w d0,a0

add.w d0,a1
sub.w d3,d2

lsr.w #1,d3 ; /2
and.w #$000F,d3

swap d4

move.w d3,d4

swap d4

.jmp_a2
sub.w d1,d3

movem.l (a0)+,d0/d6-d7/a2-a6

movem.l d0/d6-d7/a2-a6,(a1)

lea $20(a1),a1
subq.w #1,d3

bcc.b .loop_w

bra.b .continue_loop_bf

.loop_bf
bfextu d4{0:16},d3

bra.b .jmp_a2

.loop_w
move.w (a0)+,(a1)+
subq.w #1,d3

bcc.b .loop_w

.continue_loop_bf
add.w d2,a0
add.w d5,a1

subq.w #1,d4

bcc.b .loop_bf

movem.l (sp)+,d0-d7/a0-a6

rts

.quick_movem
sub.w d1,d4

.loop_qmovem
movem.l (a0)+,d1/d6-d7/a2-a6 ; 128 movem in total

movem.l d1/d6-d7/a2-a6,(a1) ; 1 movem move 8*4 = 32 bytes

lea $20(a1),a1
add.w d5,a1

subq.w #1,d4

bcc.b .loop_qmovem

movem.l (sp)+,d0-d7/a0-a6

rts

todi
Member
#2 - Posted: 27 Mar 2018 20:51
Reply Quote
Take a look at this old threads:
http://ada.untergrund.net/?p=boardthread&id=613
http://ada.untergrund.net/?p=boardthread&id=585
But if you are copying from fast -> chip, you won't get any speed gain from MOVE16.

One other thing is to copy from fast -> chip when screen DMA is off:
https://amycoders.org/opt/fasttruec2p.html

If you want to optimize even further, try weaving non memory accessing calculations (CPU & FPU) in between copy instructions and always copy longword
Cosmos Amiga
Member
#3 - Posted: 28 Mar 2018 10:43
Reply Quote
Thank you Todi !

I did a little mistake : this routine is used ony under RTG only, so it's fastram to fastram...

There is the same routine using move16 if 040/060 detected (you have to choose that in the Render Routine menu)

After some investigations, only the .quick_movem part is used, and d5 is always zero...

New code updated :


.quick_movem
sub.w d1,d4

.loop_qmovem
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
subq.w #1,d4

bcc.b .loop_qmovem

movem.l (sp)+,d0-d7/a0-a6

rts


I'm going to do some benchs using my 1200 + Mediator...
noname
Member
#4 - Posted: 28 Mar 2018 13:53 - Edited
Reply Quote
Consider using (or at least benchmarking against) CopyMem and CopyMemQuick (a bit more restrictive with alignments). Happy users could then even patch their system, e.g. with NewCMQ060 and benefit from it. Check that link also for a good speed comparision table and the source.
hellfire
Member
#5 - Posted: 28 Mar 2018 14:19 - Edited
Reply Quote
Cosmos Amiga:
move.l (a0)+,(a1)+

On 060 you get only one cache-access per cycle. So the move takes two cycles, one for the read and one for the write.
If you have something else to do, you could split the read & write into two separate instructions and get two free instruction slots in the secondary execution pipeline.
Cosmos Amiga
Member
#6 - Posted: 5 Oct 2018 09:37 - Edited
Reply Quote
I finally tried, and zero speedup...


So, I removed all the missing 040/060 fpu opcodes (fsin, fcos & fsincos) and replaced the emulated fsqrt by a real fsqrt.x, and I get a little speedup for the 68060.

Here to download : http://warpclassic68k.blogspot.com/p/blog-page.html


Yes, there is some emulated fpu routines inside, don't know why exactly...

fsqrt.x is well inside the 68881/2, 68040 and 68060, so what ???...


Some guys here know the coder Max (Maximo Piva) to ask ?
hellfire
Member
#7 - Posted: 13 Nov 2018 11:46
Reply Quote
Cosmos Amiga:
I replaced the emulated fsqrt by a real fsqrt.x, and I get a little speedup for the 68060.

I'm not familiar with the 060-port but the Quake1 source was famous for its 1/sqrt trick which is described here:
https://en.wikipedia.org/wiki/Fast_inverse_square_root
If that's ported properly to big-endianness, it should be faster the fpu (unless you can schedule it early enough and run it in parallel to some integer-only code).
Cosmos Amiga
Member
#8 - Posted: 16 Nov 2018 15:53 - Edited
Reply Quote
Into Quake1 or Quake3 ?

For this PXL port, it's really a fsqrt... And many fsin & fcos follow the same way, why not using directly fsin & fcos ?

Some other things are weird into the version, who cost a lot of cycles...

Where is RedSkull, maybe he know Max ?
Cosmos Amiga
Member
#9 - Posted: 17 Nov 2018 09:45 - Edited
Reply Quote
There is in the source :

- 30x fsincos
- 10x fsin
- 4x fabs
- 16x fsqrt
- 0x fcos

And :

- fsin emulation subroutine : called 15x
- fabs emulation subroutine : called 17x
- fsqrt emulation subroutine : called 8x
- fcos emulation subroutine : called 13x


Of source, the software emulated subroutines are much slower than the real hardware things...

Strange, isn't it ?!
todi
Member
#10 - Posted: 17 Nov 2018 17:36
Reply Quote
FSINCOS, FSIN, FCOS & FSQRT are unimplemented instructions on the 68060, they are emulated by the 68060.library. (On 68040 FSQRT do exists but FSINCOS, FSIN, FCOS are unimplemented instructions and are emulated by the 68040.library)

But maybe the speed gain is because you have something like OxyPatcher or CyberPatcher on your system, which use different emulated implementations of FSINCOS, FSIN, FCOS & FSQRT?

Cosmos Amiga
Member
#11 - Posted: 18 Nov 2018 13:10 - Edited
Reply Quote
Yes, but why 30 + 10 + 4 = 44 unimplemented instructions still here ?

And why the subroutines fsqrt and fabs ?

It's no sense for me...

 

  Please log in to comment

  

  

  

 

A.D.A. Amiga Demoscene Archive, Version 3.0