A.D.A. Amiga Demoscene Archive

Amiga Demoscene Archive Forum / Coding / Looking for good coders for improving the gfx Quake copy routine

Author	Message
Cosmos Amiga Member	#1 - Posted: 27 Mar 2018 18:10 - Edited Reply Quote I have this graphics copy routine from fastram to fastram (RTG) into my resourced asm source of the old Quake1 from PXL Computers : the goal is to optimize it to reach maximum speed using 68060... First, the ugly 512 movem were reduced to only 4 because of the code cache and the loops buffer... Next, what do you suggest ? - move.l instead movem.l ? - prefecht read lines data cache ? - other ideas ? ; (d0.w/d1.w/d2.w/d3.w/d4.w/d5.w/a0/a1) () _CopyMemQuake1 movem.l d0-d7/a0-a6,-(sp) subq.w #1,d4 tst.w d1 beq.b .skip_mulu move.w d1,d6 mulu.w d2,d6 add.l d6,a0 add.l d6,a1 .skip_mulu move.w d3,d1 lsr.w #5,d1 ; /32 subq.w #1,d1 eor.w #$007F,d1 lsl.w #2,d1 move.w d1,d0 lsl.w #1,d1 add.w d0,d1 sub.w d3,d5 cmp.w d2,d3 beq.b .quick_movem add.w d0,a0 add.w d0,a1 sub.w d3,d2 lsr.w #1,d3 ; /2 and.w #$000F,d3 swap d4 move.w d3,d4 swap d4 .jmp_a2 sub.w d1,d3 movem.l (a0)+,d0/d6-d7/a2-a6 movem.l d0/d6-d7/a2-a6,(a1) lea $20(a1),a1 subq.w #1,d3 bcc.b .loop_w bra.b .continue_loop_bf .loop_bf bfextu d4{0:16},d3 bra.b .jmp_a2 .loop_w move.w (a0)+,(a1)+ subq.w #1,d3 bcc.b .loop_w .continue_loop_bf add.w d2,a0 add.w d5,a1 subq.w #1,d4 bcc.b .loop_bf movem.l (sp)+,d0-d7/a0-a6 rts .quick_movem sub.w d1,d4 .loop_qmovem movem.l (a0)+,d1/d6-d7/a2-a6 ; 128 movem in total movem.l d1/d6-d7/a2-a6,(a1) ; 1 movem move 8*4 = 32 bytes lea $20(a1),a1 add.w d5,a1 subq.w #1,d4 bcc.b .loop_qmovem movem.l (sp)+,d0-d7/a0-a6 rts
todi Member	#2 - Posted: 27 Mar 2018 20:51 Reply Quote Take a look at this old threads: http://ada.untergrund.net/?p=boardthread&id=613 http://ada.untergrund.net/?p=boardthread&id=585 But if you are copying from fast -> chip, you won't get any speed gain from MOVE16. One other thing is to copy from fast -> chip when screen DMA is off: https://amycoders.org/opt/fasttruec2p.html If you want to optimize even further, try weaving non memory accessing calculations (CPU & FPU) in between copy instructions and always copy longword
Cosmos Amiga Member	#3 - Posted: 28 Mar 2018 10:43 Reply Quote Thank you Todi ! I did a little mistake : this routine is used ony under RTG only, so it's fastram to fastram... There is the same routine using move16 if 040/060 detected (you have to choose that in the Render Routine menu) After some investigations, only the .quick_movem part is used, and d5 is always zero... New code updated : .quick_movem sub.w d1,d4 .loop_qmovem move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ subq.w #1,d4 bcc.b .loop_qmovem movem.l (sp)+,d0-d7/a0-a6 rts I'm going to do some benchs using my 1200 + Mediator...
noname Member	#4 - Posted: 28 Mar 2018 13:53 - Edited Reply Quote Consider using (or at least benchmarking against) CopyMem and CopyMemQuick (a bit more restrictive with alignments). Happy users could then even patch their system, e.g. with NewCMQ060 and benefit from it. Check that link also for a good speed comparision table and the source.
hellfire Member	#5 - Posted: 28 Mar 2018 14:19 - Edited Reply Quote Cosmos Amiga: move.l (a0)+,(a1)+ On 060 you get only one cache-access per cycle. So the move takes two cycles, one for the read and one for the write. If you have something else to do, you could split the read & write into two separate instructions and get two free instruction slots in the secondary execution pipeline.
Cosmos Amiga Member	#6 - Posted: 5 Oct 2018 09:37 - Edited Reply Quote I finally tried, and zero speedup... So, I removed all the missing 040/060 fpu opcodes (fsin, fcos & fsincos) and replaced the emulated fsqrt by a real fsqrt.x, and I get a little speedup for the 68060. Here to download : http://warpclassic68k.blogspot.com/p/blog-page.html Yes, there is some emulated fpu routines inside, don't know why exactly... fsqrt.x is well inside the 68881/2, 68040 and 68060, so what ???... Some guys here know the coder Max (Maximo Piva) to ask ?
hellfire Member	#7 - Posted: 13 Nov 2018 11:46 Reply Quote Cosmos Amiga: I replaced the emulated fsqrt by a real fsqrt.x, and I get a little speedup for the 68060. I'm not familiar with the 060-port but the Quake1 source was famous for its 1/sqrt trick which is described here: https://en.wikipedia.org/wiki/Fast_inverse_square_root If that's ported properly to big-endianness, it should be faster the fpu (unless you can schedule it early enough and run it in parallel to some integer-only code).
Cosmos Amiga Member	#8 - Posted: 16 Nov 2018 15:53 - Edited Reply Quote Into Quake1 or Quake3 ? For this PXL port, it's really a fsqrt... And many fsin & fcos follow the same way, why not using directly fsin & fcos ? Some other things are weird into the version, who cost a lot of cycles... Where is RedSkull, maybe he know Max ?
Cosmos Amiga Member	#9 - Posted: 17 Nov 2018 09:45 - Edited Reply Quote There is in the source : - 30x fsincos - 10x fsin - 4x fabs - 16x fsqrt - 0x fcos And : - fsin emulation subroutine : called 15x - fabs emulation subroutine : called 17x - fsqrt emulation subroutine : called 8x - fcos emulation subroutine : called 13x Of source, the software emulated subroutines are much slower than the real hardware things... Strange, isn't it ?!
todi Member	#10 - Posted: 17 Nov 2018 17:36 Reply Quote FSINCOS, FSIN, FCOS & FSQRT are unimplemented instructions on the 68060, they are emulated by the 68060.library. (On 68040 FSQRT do exists but FSINCOS, FSIN, FCOS are unimplemented instructions and are emulated by the 68040.library) But maybe the speed gain is because you have something like OxyPatcher or CyberPatcher on your system, which use different emulated implementations of FSINCOS, FSIN, FCOS & FSQRT?
Cosmos Amiga Member	#11 - Posted: 18 Nov 2018 13:10 - Edited Reply Quote Yes, but why 30 + 10 + 4 = 44 unimplemented instructions still here ? And why the subroutines fsqrt and fabs ? It's no sense for me...

A.D.A. Amiga Demoscene Archive, Version 3.0