A.D.A. Amiga Demoscene Archive

Amiga Demoscene Archive Forum / Coding / memory stalls

Author	Message
dalton Member	#1 - Posted: 23 May 2008 12:47 Reply Quote does anyone know the penalty (in cycles) for reading or writing data that's not in the cache? blizzard 1260/50
Kalms Member	#2 - Posted: 23 May 2008 13:43 - Edited Reply Quote On the Blizzard 1260/50, transferring a cache line to/from fastram takes approximately 20 cycles. Let's assume that you are doing an access which does not straddle a 16-byte boundary. A memory read/write (for example "move.b (a0),d0" or "move.b d0,(a0)") causes 0, 1 or 2 line transfers depending on the exact contents of the datacache at that moment. - if the line you are attempting to access is already in the cache, the read/write goes directly against that cacheline, no fastram access involved - if the line is not in the cache, it has to be read in from fastram; the CPU pipelines stall until this transfer is completed and then the pipelines continue - if the newly read-in line replaces a dirty cacheline, that cacheline will (transparently) be written out to fastram afterward. This is done in parallel with CPU pipeline operation so you only notice this if you cause another cache miss before the writeback is complete. Read chapter 5 in the 68060UM for details.
Kalms Member	#3 - Posted: 23 May 2008 13:46 Reply Quote oh, wait, I forgot: when a read miss occurs the CPU will continue pipeline processing as soon as the requested data is available (that is, before the entire line has been fetched from fastram). Beware of data accesses which cross a 16-byte boundary, they are expensive. Again, 68060UM has the full story.
dalton Member	#4 - Posted: 23 May 2008 14:20 Reply Quote Is there any way to make use of the wasted cycles? I've always assumed that program execution would continue during the line fetch, as long as there are no data dependencies. If the pipelines are stopped I suppose it would make sense to peek the first word of the needed line some time in advance..
Kalms Member	#5 - Posted: 23 May 2008 17:37 - Edited Reply Quote Simple example: Let's assume that fetching a cacheline takes 16 cycles, 4 cycles for each longword in the cacheline. The datacache is empty a0 = $10000000 - in fastmem Now, move.l (a0),d0 will trigger a cacheline fetch. This will occupy the bus for 16 cycles. The CPU pipelines will pause until the requested set of data is available. That set of data is available after the 1st longword has been fetched. Thus, the CPU pipelines will pause for 4 cycles. During the next 12 cycles, if you touch that cacheline or cause a bus transaction, the CPU pipelines will pause until the entire cacheline has been read. See 68060UM for details. So this loop is good: move.l (a0)+,d0 calculations... move.l (a0)+,d 0 move.l (a0)+,d0 move.l (a0)+,d0 and this loop is also good: tst.b 16(a0) move.l (a0)+,d0 move.l (a0)+,d0 move.l (a0)+,d0 move.l (a0)+,d0 calculations... while this loop is a bit slower: move.l (a0)+,d0 move.l (a0)+,d0 move.l (a0)+,d 0 move.l (a0)+,d0 calculations... But generally, what makes the most difference is: 16-byte align your datastructures and make sure that they are as small as possible.
dalton Member	#6 - Posted: 23 May 2008 19:16 Reply Quote thanks alot for those tips!

A.D.A. Amiga Demoscene Archive, Version 3.0