A.D.A. Amiga Demoscene Archive

Amiga Demoscene Archive Forum / Coding / movem

Author	Message
dalton Member	#1 - Posted: 18 Feb 2011 09:16 Reply Quote I was thinking I would write generalized memcopy and memclear routines designed to perform well on 060. So I have some ideas, but I'm not sure if they're correct. Movem of 16-byte chunks should be optimal, as a cache page is 16-bytes large. But movem writes can't use post-increment, only pre-decrement, so does that mean that the longwords (inside cache frame) are fetched in reverse-order? If that's the case, I suspect movem (without pre-decrement, So I have these two options for clearing memory: Alternative 1, a0 is initialized to end of buffer .loop movem.l d1-d4,-(a0) subq.l #1,d7 ;d7 = numbytes/16 bgt.b .loop Alternative 2, a0 is initialized to beginning of buffer .loop movem.l d1-d4,(a0) adda.l d0,a0 ;d0=16 subq.l #1,d7 ;d7 = numbytes/16 bgt.b .loop Which would be the faster of these two alternatives? Also, would there be much to gain from fetching 32, or 48 bytes at a time instead of just 16 ? My gut feeling is that the performance gain should be negligible, but I really don't know.
Blueberry Member	#2 - Posted: 18 Feb 2011 13:28 - Edited Reply Quote If both your source and destination are 16-byte aligned, the fastest way to copy a large piece of memory is using the move16 instruction (present on 040+), i.e. .loop move16 (a0)+,(a1)+ subq.l #1,d7 ;d7 = numbytes/16 bgt.b .loop Writing with movem to an uncached line will trigger a read of that line, even if you write the whole line in the movem. Thus, you get two reads and one write per cache line. The move16 instruction bypasses the cache and accesses memory directly, so you only get one read and one write. The source cache line of move16 will not be put into the cache. If the destination line is in the cache, it will be evicted. But if the source line is in the cache to begin with, it will be read from the cache and only a write operation will be performed. Thus, the fastest way to clear a large piece of 16-byte aligned memory is to ensure that a particular line is in the cache (by writing to it, for instance), and then move that cache line repeatedly to each destination cache line using move16. In my tests (Blizzard 1260 50MHz), move16 takes 19 cycles from uncached memory, or 11 cycles from cached memory.
dalton Member	#3 - Posted: 18 Feb 2011 17:16 Reply Quote Ah, that's pretty cool. So basically I can use move16 to clear a large buffer, and data cache will be intact afterwards? What happens if there are pending chip writes in the data cache? Will that stall the execution of move16 or does it bypass?
Blueberry Member	#4 - Posted: 18 Feb 2011 23:56 Reply Quote I would expect move16 to wait for all pending memory accesses to complete, but I haven't tested it (or I don't remember the result if I have).
Blueberry Member	#5 - Posted: 18 Feb 2011 23:59 Reply Quote And yes, move16 will leave the data cache contents intact, apart from the cache line written by the instruction.
Blueberry Member	#6 - Posted: 10 Mar 2011 16:33 Reply Quote Hmm, it seems I have optimistic memory (my own, that is). The move16 cycle counts are exactly double the numbers I wrote. Specifically (on my A1200 - YMMV): Move16 from uncached memory takes 38 cycles. The first cycle can overlap with the last cycle of a floating point arithmetic instruction, but apart from that, it doesn't seem to permit overlap with anything. Move16 from cached memory takes 22 cycles. The last two cycles can overlap with anything that does not access memory or the cache. Some further test results, now we are at it: A read from an uncached cache line which does not cause a dirty cache line to be evicted causes a stall of 8 cycles before the instruction. During the following 11 cycles, no instruction (apart from the original read) can access memory or the cache (not even the same cache line). A write to an uncached cache line which causes a dirty cache line to be evicted causes a stall of 16 cycles before the instruction. During the following 4 cycles, no instruction (apart from the original write) can access memory or the cache (not even the same cache line). During the next 11 cycles, instructions can access the cache, but a cache miss causes a stall until the push buffer (writing the dirty cache line) is empty. Thus, some lower bounds (when not using move16): Reading a large piece of memory: 19 cycles per cache line Writing a large piece of memory: 31 cycles per cache line Reading and writing the same large piece of memory: 32 cycles per cache line Reading and writing different large pieces of memory: 50 cycles per cache line Some conclusions: - Fast memory is not so fast after all (though still many times faster than chip). - There is a lot of potential for combining memory accesses with computations, as long as the computations do not need to access the cache. - When writing to uncached memory, it pays off to do a read of each cache line some time before the write, in order to get the 8 cycle stall on the read rather than the 16 cycle stall on the write.
dalton Member	#7 - Posted: 14 Mar 2011 20:46 Reply Quote that's really interesting stuff!

A.D.A. Amiga Demoscene Archive, Version 3.0