Simple example:
Let's assume that fetching a cacheline takes 16 cycles, 4 cycles for each longword in the cacheline.
The datacache is empty
a0 = $10000000 - in fastmem
Now,
will trigger a cacheline fetch. This will occupy the bus for 16 cycles. The CPU pipelines will pause until the requested set of data is available. That set of data is available after the 1st longword has been fetched. Thus, the CPU pipelines will pause for 4 cycles.
During the next 12 cycles, if you touch that cacheline or cause a bus transaction, the CPU pipelines will pause until the entire cacheline has been read. See 68060UM for details.
So this loop is good:
move.l (a0)+,d0
calculations...
move.l (a0)+,d 0
move.l (a0)+,d0
move.l (a0)+,d0
and this loop is also good:
tst.b 16(a0)
move.l (a0)+,d0
move.l (a0)+,d0
move.l (a0)+,d0
move.l (a0)+,d0
calculations...
while this loop is a bit slower:
move.l (a0)+,d0
move.l (a0)+,d0
move.l (a0)+,d 0
move.l (a0)+,d0
calculations...
But generally, what makes the most difference is: 16-byte align your datastructures and make sure that they are as small as possible.