|
Author |
Message |
Cosmos Amiga
Member |
I have this graphics copy routine from fastram to fastram (RTG) into my resourced asm source of the old Quake1 from PXL Computers : the goal is to optimize it to reach maximum speed using 68060... First, the ugly 512 movem were reduced to only 4 because of the code cache and the loops buffer... Next, what do you suggest ? - move.l instead movem.l ? - prefecht read lines data cache ? - other ideas ? ; (d0.w/d1.w/d2.w/d3.w/d4.w/d5.w/a0/a1) ()
_CopyMemQuake1 movem.l d0-d7/a0-a6,-(sp)
subq.w #1,d4 tst.w d1
beq.b .skip_mulu move.w d1,d6
mulu.w d2,d6
add.l d6,a0 add.l d6,a1
.skip_mulu move.w d3,d1 lsr.w #5,d1 ; /32
subq.w #1,d1 eor.w #$007F,d1
lsl.w #2,d1 move.w d1,d0
lsl.w #1,d1 add.w d0,d1
sub.w d3,d5 cmp.w d2,d3
beq.b .quick_movem add.w d0,a0
add.w d0,a1 sub.w d3,d2
lsr.w #1,d3 ; /2 and.w #$000F,d3
swap d4
move.w d3,d4
swap d4
.jmp_a2 sub.w d1,d3
movem.l (a0)+,d0/d6-d7/a2-a6
movem.l d0/d6-d7/a2-a6,(a1)
lea $20(a1),a1 subq.w #1,d3
bcc.b .loop_w
bra.b .continue_loop_bf
.loop_bf bfextu d4{0:16},d3
bra.b .jmp_a2
.loop_w move.w (a0)+,(a1)+ subq.w #1,d3
bcc.b .loop_w
.continue_loop_bf add.w d2,a0 add.w d5,a1
subq.w #1,d4
bcc.b .loop_bf
movem.l (sp)+,d0-d7/a0-a6
rts
.quick_movem sub.w d1,d4
.loop_qmovem movem.l (a0)+,d1/d6-d7/a2-a6 ; 128 movem in total
movem.l d1/d6-d7/a2-a6,(a1) ; 1 movem move 8*4 = 32 bytes
lea $20(a1),a1 add.w d5,a1
subq.w #1,d4
bcc.b .loop_qmovem
movem.l (sp)+,d0-d7/a0-a6
rts
|
todi
Member |
Take a look at this old threads: http://ada.untergrund.net/?p=boardthread&id=613http://ada.untergrund.net/?p=boardthread&id=585But if you are copying from fast -> chip, you won't get any speed gain from MOVE16. One other thing is to copy from fast -> chip when screen DMA is off: https://amycoders.org/opt/fasttruec2p.htmlIf you want to optimize even further, try weaving non memory accessing calculations (CPU & FPU) in between copy instructions and always copy longword
|
Cosmos Amiga
Member |
Thank you Todi ! I did a little mistake : this routine is used ony under RTG only, so it's fastram to fastram... There is the same routine using move16 if 040/060 detected (you have to choose that in the Render Routine menu) After some investigations, only the .quick_movem part is used, and d5 is always zero... New code updated : .quick_movem sub.w d1,d4
.loop_qmovem move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ subq.w #1,d4
bcc.b .loop_qmovem
movem.l (sp)+,d0-d7/a0-a6
rts
I'm going to do some benchs using my 1200 + Mediator...
|
noname
Member |
Consider using (or at least benchmarking against) CopyMem and CopyMemQuick (a bit more restrictive with alignments). Happy users could then even patch their system, e.g. with NewCMQ060 and benefit from it. Check that link also for a good speed comparision table and the source.
|
hellfire
Member |
Cosmos Amiga: move.l (a0)+,(a1)+ On 060 you get only one cache-access per cycle. So the move takes two cycles, one for the read and one for the write. If you have something else to do, you could split the read & write into two separate instructions and get two free instruction slots in the secondary execution pipeline.
|
Cosmos Amiga
Member |
I finally tried, and zero speedup...
So, I removed all the missing 040/060 fpu opcodes (fsin, fcos & fsincos) and replaced the emulated fsqrt by a real fsqrt.x, and I get a little speedup for the 68060.
Here to download : http://warpclassic68k.blogspot.com/p/blog-page.html
Yes, there is some emulated fpu routines inside, don't know why exactly...
fsqrt.x is well inside the 68881/2, 68040 and 68060, so what ???...
Some guys here know the coder Max (Maximo Piva) to ask ?
|
hellfire
Member |
Cosmos Amiga: I replaced the emulated fsqrt by a real fsqrt.x, and I get a little speedup for the 68060. I'm not familiar with the 060-port but the Quake1 source was famous for its 1/sqrt trick which is described here: https://en.wikipedia.org/wiki/Fast_inverse_square_root If that's ported properly to big-endianness, it should be faster the fpu (unless you can schedule it early enough and run it in parallel to some integer-only code).
|
Cosmos Amiga
Member |
Into Quake1 or Quake3 ?
For this PXL port, it's really a fsqrt... And many fsin & fcos follow the same way, why not using directly fsin & fcos ?
Some other things are weird into the version, who cost a lot of cycles...
Where is RedSkull, maybe he know Max ?
|
Cosmos Amiga
Member |
There is in the source :
- 30x fsincos - 10x fsin - 4x fabs - 16x fsqrt - 0x fcos
And :
- fsin emulation subroutine : called 15x - fabs emulation subroutine : called 17x - fsqrt emulation subroutine : called 8x - fcos emulation subroutine : called 13x
Of source, the software emulated subroutines are much slower than the real hardware things...
Strange, isn't it ?!
|
todi
Member |
FSINCOS, FSIN, FCOS & FSQRT are unimplemented instructions on the 68060, they are emulated by the 68060.library. (On 68040 FSQRT do exists but FSINCOS, FSIN, FCOS are unimplemented instructions and are emulated by the 68040.library)
But maybe the speed gain is because you have something like OxyPatcher or CyberPatcher on your system, which use different emulated implementations of FSINCOS, FSIN, FCOS & FSQRT?
|
Cosmos Amiga
Member |
Yes, but why 30 + 10 + 4 = 44 unimplemented instructions still here ?
And why the subroutines fsqrt and fabs ?
It's no sense for me...
|
|
|