A.D.A. Amiga Demoscene Archive

  Welcome guest! Please register a new account or log in

  

  

  

log in with SceneID

  

Demos Amiga Demoscene Archive Forum / Coding / 060 - fpu optimizing

 

Author Message
sp_
Member
#1 - Posted: 12 Aug 2006 09:54 - Edited
Reply Quote
I was thinking about merging some fpu code into the c2p to save some cycles.In theory the fpu instructions should run simultaniously while writing to chipmem and doing c2p conversion. Am I right here?
I also need a cycle diagram af all 060 fpu instructions. anybody got it online?
Kalms
Member
#2 - Posted: 12 Aug 2006 11:56 - Edited
Reply Quote
Yep, bus access does not interfer with the CPU's/FPU's internal operation. Be careful with what memory you're touching when feeding the FPU with data though.

------

All instructions are initially parsed by the CPU. If the FPU operation only works on FPU registers, the CPU will wait until the previous FPU operation has completed, then dispatch the current instruction to the FPU, then the CPU continues processing instructions.

FPU instructions on FP registers are classified as pOEP-but-allows-sOEP.
The FPU itself is not pipelined internally.
Check the 68060UM chapter 10 for timings (it is available as PDF from Freescale, direct link ).

So, for instance, fadd/fsub/fmul.x fp0,fp1 takes 3 cycles, according to manual. Therefore a code sequence like this runs at 2 instructions per cycle:

fadd fp0,fp1 ; pOEP ; FPU op cycle 1
move.l d2,d3 ; sOEP
move.l d0,d1 ; pOEP ; FPU op cycle 2
move.l d2,d3 ; sOEP
move.l d0,d1 ; pOEP ; FPU op cycle 3
move.l d2,d3 ; sOEP

fadd fp2,fp3 ; pOEP ; FPU op cycle 1
... etc

However, FPU operations that have memory or register floating-point operands (like fadd.s (a0)+,fp0) may go a bit slower. Float<->int conversions (like fmove.w d0,fp0) take approx. 3 extra cycles (this is specified in 68060UM chapter 10) and I think that the CPU stalls for those 3 cycles (not sure).

Also:
The FPU and the CPU seem to not share any resources at all. So you can do FDIV and DIVS in parallell, as well as FMUL at the same time as MULU/MULS.

That's all I know off the top of my head. I suggest that you set up a timing harness and conduct some experiments of your own...
sp_
Member
#3 - Posted: 25 Mar 2007 13:13
Reply Quote
How about conditional branching using the fpu. Like fbgt.w. Is brancepridiction included here as in normal bgt.w ?
Can it run on chipmemwrites.
Blueberry
Member
#4 - Posted: 26 Mar 2007 14:05
Reply Quote
FPU branches use the branch cache, yes. According to the infameous chapter 10 of the 68060 user manual fbcc takes 2 cycles on a correct prediction and 8 cycles on an incorrect prediction.

With the chipmem set to imprecise mode (SpeedyChip or non-ancient SetPatch), chip writes go to the write buffer, which can hold four longwords at a time, and from there to chipmem when the chipmem is ready for them. The 68060 can do anything in the meantime, as long as it does not run into a cache miss, in which case it will wait for the write buffer to empty before fetching the needed data into the cache.
Blueberry
Member
#5 - Posted: 1 Apr 2007 12:06
Reply Quote

fadd fp0,fp1 ; pOEP ; FPU op cycle 1
move.l d2,d3 ; sOEP
move.l d0,d1 ; pOEP ; FPU op cycle 2
move.l d2,d3 ; sOEP
move.l d0,d1 ; pOEP ; FPU op cycle 3
move.l d2,d3 ; sOEP

The sOEP is not available on the first cycle of an FPU instruction. The instructions are classified as pOEP-but-allows-sOEP, but it really means something different - that both the pOEP and the sOEP are available after the first cycle of execution.

Anyway, the cycle counts given are the execution times when FPU pipelining is disabled. When FPU pipelining is enabled, the next floating point instruction can be started one cycle earlier (i.e. after two cycles for fmul). You have to be careful not to use the same register in the two instructions, though, since the result will then be undefined. That is probably why pipelining is disabled by default.
Kalms
Member
#6 - Posted: 1 Apr 2007 23:30
Reply Quote
The sOEP is not available on the first cycle of an FPU instruction. The instructions are classified as pOEP-but-allows-sOEP, but it really means something different - that both the pOEP and the sOEP are available after the first cycle of execution.

I see. 68060UM leaves both interpretations open. My 1200 isn't connected right now, but I'll take your word on that.

Anyway, the cycle counts given are the execution times when FPU pipelining is disabled. When FPU pipelining is enabled, the next floating point instruction can be started one cycle earlier (i.e. after two cycles for fmul). You have to be careful not to use the same register in the two instructions, though, since the result will then be undefined. That is probably why pipelining is disabled by default.

This, on the other hand, is something I've never heard of. Are you sure you are thinking of the 060 here? If so, which bit is it that controls FPU pipelining?
Kalms
Member
#7 - Posted: 3 Apr 2007 02:02
Reply Quote
... or was that last bit (about FPU pipelining) an April's fools...? :)
Blueberry
Member
#8 - Posted: 3 Apr 2007 11:40
Reply Quote
... or was that last bit (about FPU pipelining) an April's fools...? :)

Indeed it was. :) No free lunch for hopeful 060 FPU coders here...
Kalms
Member
#9 - Posted: 4 Apr 2007 02:49
Reply Quote
Bah. I hope you're not getting any easter candy this year. :)
Blueberry
Member
#10 - Posted: 4 Apr 2007 11:27
Reply Quote
Too late. ;)

 

  Please register a new account or log in to comment

  

  

  

 

A.D.A. Amiga Demoscene Archive, Version 3.0