|
Author |
Message |
johnsa
Member |
Hey,
I've read some links and various bits about C2P routines from Azure, Kalms etc but I thought it would be fun and more interesting to actually code one from scratch.
I'm curious in terms of what the maximum performance would be like, I've tested mine on 030/50mhz a1200 and on winuae.
On the A1200 030/50mhz the 1x1 320x256x8 C2P runs at 25fps, CPU only. The code is 060 optimised, so not ideal for the 030.. but how does 25fps fare against the known best routines? IE: what is the theoretical limit for C2P'ing a 320x256x8 on: 030/50 060/50+ etc
Any info would be great, Cheers
|
todi
Member |
The theoretical limit is what it takes to long-word copy 320*256 bytes from fast memory to chip memory. (Some tricks, like copying from fast to chip when the DMA is off speed up things). For 060/50 the time it takes to copy a 320x256x8 image from fast to chip is approx. 220 rasterlines, so if your chunky routine takes less than 68 rasterlines you would get 50 fps. Have you looked at these articles? https://amycoders.org/sources/writepipe.htmlhttps://amycoders.org/opt/fasttruec2p.htmlhttps://amycoders.org/sources/c2ptut.html
|
johnsa
Member |
Hi,
Thanks for the info. Yep those are the articles I'd read through initially to get the basic ideas. I then went off and tried to re-create it myself without "borrowing" code, just so I had a good understanding of the merge operations and how to potentially optimise it.
I've tried to stay clear of raster-count based timing and worked rather with CIA timer to measure uS and resultant frame times.
I read about the bandwidth variations from fast->chip and used the bustest tool to verify, it would seem that when DMA is off you can get 7mb/s from fast->chip.
That should equate to about 89 copies per second of 320x256 (81kb) (under optimal conditions)?
So my idea was that a theoretical max assuming no stolen cycles on the bus, no cpu and no c2p.. should = 89fps+- From there you can start deduction for cpu interaction and the c2p algorithm itself.
So my approach in terms of the c2p algo is nothing new, I don't think there are any mysteries left to discover there.. I've interleaved phases to try and get the best instruction pairing, blocked merges, doing 32pixel per iteration with some attempt to help the cache prefetch etc.. The only slightly different approach I might have taken from normal , although I've read about it in other places was to use interleaved bitplanes. I do the c2p from fast->fast with a -256 offset to try keep the read location and write destination close for cache, I then wait for Vbl and copy the buffer with an unrolled loop as quickly as possibly (and it's already in the right interleaved planar form). I seem to be able to do the copy entirely during the Vblank, which I wasn't too convinced about given others measurements of 220 raster lines.. and the display period is usually longer than the blanking.
This is giving me 25fps on 030/50 and 48fps on winUae 060/no-jit/full-dma etc.. (although I wouldn't trust winuae for timings). This is with drawing something into the chunky buffer one pixel at a time.
If I take that out so it's just a pure c2p/copy then i get: 27fps for 030/50 (real h/w), 69fps for the winuae/no-jit/full-dma/cycle-exact. Which seems like the limit for me..
|
Blueberry
Member |
The chip memory bandwidth varies between different 060 cards. On most fast cards (and, I think, all A4000 cards) you get the 7MB/s. But on some 50MHz A1200 cards you only get around 5MB/s. I have heard a rumor that you can get the high speed by overclocking the card to just 55MHz, but I haven't tried it. :)
On 060, it's a somewhat pointless exercise to optimize the merge code, as it is already much faster than the chip mem write, so the CPU spends most of its time waiting for chip memory. To get more speed out, you need to interleave (some of) the effect and C2P code so you use those CPU cycles for something useful.
C2P from fast to fast can make sense as a way to ease this interleaving (as you just need to interleave your effect code with a simple copy), but copying the result to chip mem in a loop that does nothing else is a reckless waste of good CPU cycles. :)
|
johnsa
Member |
Point taken :)
I'm getting 7mb/s on my 1200 030/50.. maybe I'm just lucky ! ;)
Well we know we want to do the transfer from fast->chip when there is no DMA.. which I would assume would require it to be done during vblank (no raster dma running)?
Another thought I had was what about beam-chasing .. say hit raster 150, then start the copy knowing the vblank will occur before the copy reaches the beam, to ensure you can get the entire copy done in-frame..
While the copy is running, you could interleave slices of the next frame generation I guess
|
johnsa
Member |
So I've played around a bit more with this now, I guess you really need to adapt a different C2P approach depending on the type of effect.
For example, I thought I'd create a classic tunnel which (like any other effect which fills every pixel in order) I guess it makes sense to do something like:
Render chunky row (to fast mem) C2P row (directly to chip .. not sure if this is a good idea, as the only interleaving is really the C2P merges and chip write itself)
Option 2: Render chunky row (to fast mem) C2P chunky row (to fast mem) Transfer row to chip
Once again, that seems to have the issue that the transfer row to chip will spend most of it's time doing nothing.
Option 3: Render chunky row (to fast mem) C2P chunky row (to fast mem) Transfer a small block to chip (say a single MOVE16?) Loop this until the effect is fully rendered The finish off the remaining transfer to chip
Of course the other things to deal with, ideally you want the transfer to chip to be during the vblank? So do you let the small bits in the main loop just go whenever, and then wait for vblank to do the finishing off bit.
With any of these progressive C2P's if they're transferring during the frame, you clearly don't want to be transferring to the active frame.. so this poses another two options 1: Only start the main render loop say at raster 2, assuming the effect is not fast than a single raster (and besides you're only doing a single piece per row) then you can safely assume each copy is always behind the beam 2: Use two sets of bitplanes, on a vblank interrupt swap the bitplane pointers only once the copy to chip - post c2p is complete
So many questions.. but this stuff is really interesting :)
|
todi
Member |
Thinking out loud here... How about to implement some kind of chunky2compressed-bitmap, that would lower the amount of data transfer to chip, and the using the blitter to unpack the compressed data?
|
johnsa
Member |
That is an interesting idea, I like it :)
Another option would be to keep a 256 bit list of which rows have been modified (mileage varies, you'd have to have an effect that works row wise.. and you'd need to have a situation where you know only some rows are actually changing per frame) and then only c2p/copy those modified rows.. I guess you could also do NxN blocks on screen, and have the blitter put those into place..
I'm also interested in the options around how best to avoid tearing with C2P.. if you effect takes < display, and the C2p is less than vblank that is fine, but assuming that is not the case, do you c2p into a set of off-screen bitplanes and then update the copper list to page-flip them once both the effect and c2p are complete?
|
|
|