A.D.A. Amiga Demoscene Archive

  Welcome guest! Please register a new account or log in

  

  

  

log in with SceneID

  

Demos Amiga Demoscene Archive Forum / Coding / memory stalls

 

Author Message
dalton
Member
#1 - Posted: 23 May 2008 12:47
Reply Quote
does anyone know the penalty (in cycles) for reading or writing data that's not in the cache? blizzard 1260/50
Kalms
Member
#2 - Posted: 23 May 2008 13:43 - Edited
Reply Quote
On the Blizzard 1260/50, transferring a cache line to/from fastram takes approximately 20 cycles.

Let's assume that you are doing an access which does not straddle a 16-byte boundary.

A memory read/write (for example "move.b (a0),d0" or "move.b d0,(a0)") causes 0, 1 or 2 line transfers depending on the exact contents of the datacache at that moment.

- if the line you are attempting to access is already in the cache, the read/write goes directly against that cacheline, no fastram access involved
- if the line is not in the cache, it has to be read in from fastram; the CPU pipelines stall until this transfer is completed and then the pipelines continue
- if the newly read-in line replaces a dirty cacheline, that cacheline will (transparently) be written out to fastram afterward. This is done in parallel with CPU pipeline operation so you only notice this if you cause another cache miss before the writeback is complete.

Read chapter 5 in the 68060UM for details.
Kalms
Member
#3 - Posted: 23 May 2008 13:46
Reply Quote
oh, wait, I forgot: when a read miss occurs the CPU will continue pipeline processing as soon as the requested data is available (that is, before the entire line has been fetched from fastram).

Beware of data accesses which cross a 16-byte boundary, they are expensive.

Again, 68060UM has the full story.
dalton
Member
#4 - Posted: 23 May 2008 14:20
Reply Quote
Is there any way to make use of the wasted cycles? I've always assumed that program execution would continue during the line fetch, as long as there are no data dependencies.

If the pipelines are stopped I suppose it would make sense to peek the first word of the needed line some time in advance..
Kalms
Member
#5 - Posted: 23 May 2008 17:37 - Edited
Reply Quote
Simple example:

Let's assume that fetching a cacheline takes 16 cycles, 4 cycles for each longword in the cacheline.

The datacache is empty

a0 = $10000000 - in fastmem

Now,

	move.l	(a0),d0


will trigger a cacheline fetch. This will occupy the bus for 16 cycles. The CPU pipelines will pause until the requested set of data is available. That set of data is available after the 1st longword has been fetched. Thus, the CPU pipelines will pause for 4 cycles.
During the next 12 cycles, if you touch that cacheline or cause a bus transaction, the CPU pipelines will pause until the entire cacheline has been read. See 68060UM for details.

So this loop is good:

	move.l	(a0)+,d0
	calculations...
	move.l	(a0)+,d 0
	move.l	(a0)+,d0
	move.l	(a0)+,d0


and this loop is also good:

	tst.b	16(a0)
	move.l	(a0)+,d0
	move.l	(a0)+,d0
	 move.l	(a0)+,d0
	move.l	(a0)+,d0
	calculations...


while this loop is a bit slower:

	move.l	(a0)+,d0
	move.l	(a0)+,d0
	move.l	(a0)+,d 0
	move.l	(a0)+,d0
	calculations...


But generally, what makes the most difference is: 16-byte align your datastructures and make sure that they are as small as possible.
dalton
Member
#6 - Posted: 23 May 2008 19:16
Reply Quote
thanks alot for those tips!

 

  Please register a new account or log in to comment

  

  

  

 

A.D.A. Amiga Demoscene Archive, Version 3.0