|
Author |
Message |
_Jamie_
Member |
Hi,
How many cycles take this instruction on 68060
add.l d0,d1
add.l d2,(a0)
thanks
|
Blueberry
Member |
The sequence takes a total of one cycle (assuming the memory access is longword-aligned and hits in the data cache).
Based on the following observations:
- The add instruction can go in both pipelines (pOEP/sOEP),
- both instructions use simple addressing modes (at most one extension word),
- there are no register dependencies between the instructions,
- there is only one memory access in total (data cache supports one read or write access per cycle).
If the memory access misses the cache, things get more complicated. A fresh cache line needs to be read in from memory, and possibly a dirty one needs to be written back. Lots of cycles...
|
_Jamie_
Member |
i searched the answer for a cache hit:) One cycle is really nice for what i want to do:)
|
_Jamie_
Member |
how many cycle for this:
add.l d0,0(a0)
add.l d1,4(a0)
move.w 0(a0),d2
move.b 4(a0),d2
i will tranfer my amiga really soon but actually i can't do all this test:(
|
Kalms
Member |
This is my guess:
add.l d0,0(a0) ; pOEP, sOEP, pOEP
; sOEP idle
add.l d1,4(a0) ; pOEP, sOEP, pOEP
; sOEP idle
move.w 0(a0),d2 ; pOEP
; sOEP idle
move.b 4(a0),d2 ; pOEP
; sOEP idle
The code sequence is memory-access limited. 6 memory accesses (4 reads + 2 writes) => 6 cycles to execute.
|
_Jamie_
Member |
Hi Kalms,
It's what i thinked:( I found a 3.5 cycle mapping loop ( 16 & 24 precision ) but the offset is a word, and i prefer to have a long so i try to search a better solution
|
Blueberry
Member |
Something like this?
UBITS = 8
move.l d0,d5
move.b d7,(a1)+
move.w d1,d5
add.l a0,d0
lsr.l #16-UBITS,d5
add.l a1,d1
move.b (a2,d6.l),d7
move.l d0,d6
move.b d7,(a1)+
move.w d1,d6
add.l a0,d0
lsr.l #16-UBITS,d6
add.l a1,d1
move.b (a2,d5.l),d7
You can use any size in both U and V direction, though the total number of U bits is 16 in any case. U wraps but V doesn't. If you want V to wrap as well, you have to insert a masking instruction of some kind, briging it up to 4 cycles (though the difference will probably drown in texture cache misses).
You can also use .w indexing with this version with 8 bits for U and V and get the wrapping for free. The distance from the calculation of the index to the use is long enough to avoid the 3-cycle stall for .w indexing.
Are you working on some new, fancy texture mapper? ;)
|
_Jamie_
Member |
Hi blueberry,
I want really more precision for my tmapper, i use this for the moment and it seems to be impossible to find better
;------------------------------------------------- ---------------------
;pixel innerloop
;------------------------------------------------- ---------------------
;U1 P1 P2 P3
;V1 P1 P2 p3
;------------------------------------------------- ---------------------
;U1 P1 P2 V1
;P1 P2 p3 xx
;------------------------------------------------- ---------------------
;d0,d1,d2,d3
;a0,a1,a6,a7
;------------------------------------------------- ---------------------
loopPixel:
move.l d0,d2 ;p1
add.l a7,a6 ;s1
addx.l d1,d0 ;2
move.b (a0,d3.w),(a1)+ ;3
rol.l #8,d2 ;p4
move.l d0,d3 ;s4
add.l a7,a6 ;p5
rol.l #8,d3 ;s5
addx.l d1,d0 ;6
move.b (a0,d2.w),(a1)+ ;7
I convert my Wii and Ds engine on amiga, i tested my code on winuae but you know it's really not good for timing.
|
_Jamie_
Member |
For cache you can use swizlled texture, but in my case i prefer to have no overdraw, and with no overdraw swizlled texture it's not really usefull
|
Blueberry
Member |
Hmm, some comments on the code:
- I assume you mean add.l d7,d6 rather than add.l a7,a6 ? Adding to an address register does not set the X flag needed for the addx.
- There is only two cycles of gap between d2 being modified (cycle 4) and it being used as a word index register (cycle 7). Since word indexing needs 3 cycles of gap, the last move will stall for one cycle.
- You have put up the memory-to-memory moves as taking one cycle, using both pOEP and sOEP. According to Chapter 10, memory-to-memory moves like these take 2 cycles (Table 10-6) and are classified as pOEP-until-last (Table 10-2), meaning another instruction can be dispatched in the sOEP (but not in the pOEP) on the second cycle.
However, some timing experiments I did on my real Amiga yesterday indicate that these moves actually only take one cycle and allow another instruction in the sOEP on the same cycle, as long as there are no memory accesses for a few cycles afterwards. Can anyone confirm this? I cannot do any more testing at the moment, since my Amiga is not working today. :(
If this last observation is correct, it should actually be possible to reduce the mapping loop to 3 cycles per pixel. :-D
|
Blueberry
Member |
Regarding swizzling the texture: Unless your textures are so small that they mostly fit in the cache, I think the extra cycle per pixel required for swizzled lookup will pay off with the better cache usage.
Why would the benefit of swizzling have anything to do with overdraw?
|
_Jamie_
Member |
_ for the bitx it's a mistake
_ I thinked it was only 2 cycles of gap
_ 3 cycle per pixel could be very nice
_ i use multisize texture, so for a 256*256 it will not fit in the cache for sure. I tested the swizzled texture with a rotozoom and with my old engine, i won absolutely nothing maybe because i use cache for coverage buffer
|
_Jamie_
Member |
loopPixel: move.b (a0,d2.w),(a1)+ ;s1
move.l d0,d2 ;p1
add.l d7,d6 ;s2
rol.l #8,d2 ;p2
addx.l d1,d0 ;3
move.b (a0,d3.w),(a1)+ ;s4
move.l d0,d3 p4
add.l d7,d6 ;s5
rol.l #8,d3 ;p5
addx.l d1,d0 ;6
is it the perfect pixel loop?
|
Toffeeman
Member |
When are you guys going to make a demo together ?
You could call it "Coders Porn" :0)
|
Blueberry
Member |
Oh well... I managed to get my Amiga running for a few minutes to do some more tests. It seems I was mistaken about the memory-to-memory move - it is as specified in Chapter 10, after all. Some rounding in my timing code caused the cycle counts to be returned as one less that it actually was....
Anyway, that changes the picture a bit, since this confirms that a memory-to-memory move wastes half a cycle, since it can only execute one sOEP-instruction during its 2-cycle execution. Thus, the mapping loop would have to look something like this:
move.b (a0,d2.w),d4 ;p1
move.l d0,d2 ;s1
move.b d4,(a1)+ ;p2
add.l d7,d6 ;s2
addx.l d1,d0 ;3
move.b (a0,d3.w),d4 ;p4
rol.l #8,d2 ;s4
move.b d4,(a1)+ ;p5
move.l d0,d3 ; s5
add.l d7,d6 ;p6
rol.l #8,d3 ;s6
addx.l d1,d0 ;7
giving 3.5 cycles per pixel.
|
|
|