Author |
Message |
Reloaded
Member |
Hello. I have a question: What is the fastest routine to do c2p in 68060, and only in memory fast (from fast to fast)?
Thanks.
|
_Jamie_
Member |
Hi Reloaded,
My own version was 0,23 vbl and i'm sure it was not the fastest possible.
|
Reloaded
Member |
Hello _Jamie_. Is your c2p routine 'public'? Je je je.
|
_Jamie_
Member |
No it isn't, but i'm sure you can find some public c2p and just write in fast ram and it will more or less the same result
|
Azure
Member |
On 060 there are some tricks to speed up C2Ps by using "rot-merges" which rely on the fact that the barrel shifters on the 68060 are extremely fast. IIRC this allows to speed up merges by one cycles or something. Maybe I can still find my old sources. This optimization is of no use of you render to chipmem since there is plenty of latency during writes, but it may help for fastmem.
|
Reloaded
Member |
Thanks for your comments, Azure. If you find your c2p sources for 68060, it would be of great help.
|
Kalms
Member |
reloaded, azure:
Download amycoding.redline.ru/main/sources/kalmsc2p.lha. Inside that archive, the source named c2p/others/cpu5azure2.asm uses the rot-merge approach. Perhaps that one will do?
|
Blueberry
Member |
A c2p that was written for converting from fastmem to chipmem will not get you anywhere near the optimal fastmem-to-fastmem conversion if you just use it blindly. The cache considerations are completely different.
When converting from fast to fast, it is important that the planar data is written (almost) on top of the chunky data to utilize the cache optimally. In this case, the memory is only read once and written once. If you write it to a different place, the old contents of the destination will also be read into the cache along the way, resulting in twice as much memory read. And fast mem is still slow enough for this to matter.
In order to achieve this overlap, the bitplanes need to be interleaved, i.e. first row from all bitplanes, then second row from all bitplanes and so on. This way, each 320-byte line of chunky data will be written to a contiguous block of planar data, albeit in a different order.
In order not to overwrite data before you need it, the planar data need to be positioned starting slightly before the chunky data. For a 320 pixels wide screen, having the bitplanes start 256 bytes before the chunky data is sufficient.
There is even more to be gained on top of this by prefetching chunky data ahead of the use, but the technique described here is what makes the greatest difference.
|
Azure
Member |
kalms: nice find, I did not remember I had spread my version of that c2p :) Unfortunately it does not appear to be in the archive, but that may be an issue with my version of LHA.
blueberry: well, I guess you have to make sure that each bitline occupies a different cacheline. You would not necessarily need interleaved bitplanes, but have you to make sure to use the right bitplane offsets.
|
Azure
Member |
btw. I remembered another c2p optimization I never really tried. The key to an efficient c2p is to have a high chipmem bandwidth. This is the case when you write to the chipmem when only 4 or less bitplanes are active. Of course, usually you are using 8 bitplanes, hence you get a lot of additional waitstates - basically the chipmem bandwidth for the cpu drops from 7.x Mb/s to 4 mb/s. A trick to work around this is to write to the chipmem only during the VBL. Since it is not possible to write an entire 320x200 (or x256) screen during one VBL the conversion has to be split up across several frames.
A working solution could for example be to lock your engine to 50/3 fps and spread the chipmem writes across 3 frames. If you have a steady framerate it will still look smooth.
|
noname
Member |
azure: don't know much about c2p, but isn't that very similar to what sp advocated in one of his amycoders tutorials back then? iirc he sat up a screen with big black borders, e.g. 320x180, and only did c2p when the dma wasn't busy displaying the bitplanes.
|
Kalms
Member |
azure: it's in there... try with a different version of LhA (or Total Commander) :)
Btw, I'm pretty sure that Jamie does that (c2p fast->fast, then copy fast->chip during VBL) in the engine for his latest demos. I considered doing that on occasion but was too lazy to specialize my code enough to have fast->chip copying interleaved into the processing of other stuff...
|
_Jamie_
Member |
I convert during bitplanes are inactive, i need only 2 vbl for convert 320*200. I finally don't interleave fast-chip copy during other operation because i'm lazy too and it's not really good for the cache exept for some specific task (ex metaball )
btw other way to win time machine is to use a real triple buffer
|
Azure
Member |
kalms: I checked with a hex editor, there is no file named cpu5azure2.asm in the archive. Maybe you have a different archive?
noname: Yeah, it's possible he mentioned that. It's not exactly a novel idea.
|
Blueberry
Member |
Azure: Cache line alignment does not matter much when you are reading and writing a large, contiguous area of memory. Different alignments will cause different instructions to miss the cache, which can give a small difference, but this can be avoided by inserting explicit prefetch instructions which will take the cache miss no matter what the alignment is.
Even then, it is probably a good idea to make sure the alignment is always the same, so that the performance is predictable. Since executable file sections and memory allocations are only 8-byte aligned this involves allocating a larger area and rounding the address up to align it.
You do need interleaved bitplanes in order to read and write to the same memory area. Otherwise you will overwrite chunky data further down the screen before you get to read it.
|
Kalms
Member |
azure: argh, that was an old version of the archive. sorry about that. the archive over at http://www.modermodemet.se/dalton/src/c2p/kalmsc2p .lha does include the file (I just checked).
|
Azure
Member |
kalms: Thanks! That looks like the right one. Good thing the internet never forgets, it's much safer than having your own back ups :)
blue: Ok, I was not aware you wanted to write to the same buffer. In that case it makes sense.
|
Reloaded
Member |
Kalms, thanks for the archive with the azure source. And thanks to azure too.
|
sp_
Member |
It's possible to Optimize the Azure's c2p further by using interleaved bitplanes or a screen smaller than 320x234.
Then you can remove 8 instructions pr. loop (add.l #.plane,a1)
and use: move.l dx,xxxx(a1) instead.
|