|
Author |
Message |
xeron
Member |
A discussion on pouet got me thinking...
Apparently, the Atari ST's interleaved bitplanes make it possible to use movep to somehow accelerate C2P (now, I've never used movep, so I don't know the specifics...) but anyway.. if interleaved bitplanes are useful for fast C2P, surely the Amiga can have a similar display using modulos? IE set up each bitplane pointer to 40 bytes past the previous one, and the modulo to 40*planes. Then you can do movep c2p.
|
xeron
Member |
Ehm... it seems I misunderstood what was meant by "interleaved". So that won't work.
|
sp_
Member |
The movep instructuion is not usable in a c2p on Mc68020+ since memory writes to chipmem won't pipeline with cpu-instructions. On amiga The c2p conversion is done in copyspeed on 030 50mhz + (scrambled, or blitter) and copyspeed 040+ (5 pass merge). Copyspeed definition is as fast as moving a longword from fastram to chipram.
-
There are ways to increase the copyspeed. Sometimes the DMA makes chipmem slower. If you do the c2p conversion when the chipmem is not busy with rendering the screen, and render something in fastmem when the chipmem is busy. your c2p can go up to 3 times faster! . This can be achieved by using the copper to trigger a multitask interrupt. Wait for a rasterline... Trigger c2p convertion wait some more.. trigger effect renderer. wait again.. trigger c2p.. etc..
Read more about it here in my toturial:
The 3 * faster than copyspeed trick:
http://membres.lycos.fr/amycoders/sources/fasttrue c2p.html
|
winden
Member |
Oh man, that tutorial was really great at the time ^^ your comming to breakpoint next year don't you sp?
|
Kalms
Member |
MOVEP.L D0,0(A0) writes the bytes in D0 to locations A0+0, A0+2, A0+4, A0+6. Thus it will perform four individual byte write-operations. One MOVEP on 68000 is somewhat slower than an ordinary MOVE.L but it replaces roughly 20% of the C2P transformation instructions. Therefore it is advantageous to use it on the Atari ST.
[Atari ST again] Also, if you're rendering 4bpl and doing C2P at the same time (i.e. you're collecting the pixels together in a register and flush them out 16 at a time), then it so happens that if you write them out using MOVEP.L instead of MOVE.L, you can cache 8 pixels in one longword and have the bits end up in the correct places (whereas if you used MOVE.L you would need to cache 16 pixels, spread over two registers). So using the MOVEP to write out pixel data is more convenient.
MOVEP might be useful in a similar vein on the A500, but I haven't really looked into it.
On 68020-68040 the MOVEP instruction is not as useful because it does so many memory individual memory accesses, and that becomes a performance bottleneck (absolutely when accessing chipmem, less so when accessing fastmem). On 68060 the instruction is software emulated so it is not useful at all on that procesor.
|
sp_
Member |
re, winden
We will see. I plan to move to Thailand for a while, and maybe I code some Mc680x0 again in my bungalow. Actually I've never been to breakpoint before so I really should come.
|
sp_
Member |
How to do fast unscrambled 4bitplane c2p on amiga500 (1x2) (320*256)
7mhz Mc68000
2 suggestions:
1 cpu merge and 3 blitter merges.
1 cpu merge 2 blitter merges, and use hardware scroll register with shademask.(2x2)
(Just a fast suggestion, not tested)
A c2p needs 5 merges.
Assuming a0 point to a 8bit chunky buffer with 4bit colors and a1 to chipmem buffer.(swap 4)
.swap4c2p
movem.l (a0)+,d0-d1
lsl.l #4,d0
lsl.l #4,d1
or.l (a0)+,d0
or.l (a0)+,d1
movem.l d0-d1,(a1)+
cmpa.l a2,a1
bne.b .swap4c2p
After this we will need 3 ABD blitter pass. The swap 16 pass is not needed since the blitter is 16 bit.
I think blitter with ABD channels on amiga500 can copy around 40k/frame. (???)
Each mergepass will copy 320*128 /2 = 10240 bytes.
a total 10240 * 3 = 30720 bytes wich under 1 frame...
The verticl strecth 1x2 is done with the copper and modulo.
For a 2x2 mode it's possible to remove one blitterpass and use the scroll
register instead(scroll the even bitplanes one pixel to the right) to emulate the swap 1 pass.
The 5th bitplane should contain a black dot mask to remove every second garbage pixel.
This tecnique was inventedby Ludde back in 1995. and used in many 1995 demos.. Like the TP95
winning demo Closer.
..
For table effects and rotzoomers etc. it might be a good idea to put the effect code inside the c2p to avoid the chunky buffer.I think its possible to to 25fps fullscreen newscool effects on the 7mhz Mc68000 A500 (in 160*128) res.
|
sp_
Member |
The html seems to distort the txt, but I will make a complete version when I complete the code.
Edit:
I had to include a swap 16 pass. The swap 16 pas might be removed or replaced by moving words from the chunky buffer. might be faster on the Mc68000. Two source examples incuded in the bottom.
..
Mc68000 Amiga 500 c2p
Amiga 500 (7mhz Mc68000)
Algoritm by Rune Stensland (c) 2006 (sp/ contraz)
sp@kvarteret.uib.no
1.5 cpu pass, 2 blitter pass 2x2 c2p (scrollregister trick)
Input:
00000000a3a2a1a0 00000000b3b2b1b0 00000000c3c2c1c0 00000000d3d2d1d0
00000000e3e2e1e0 00000000f3f2f1f0 00000000g3g2g1g0 00000000h3h2h1h0
00000000i3i2i1i0 00000000j3j2j1j0 00000000k3k2k1k0 00000000l3l2l1l0
00000000m3m2m1m0 00000000n3n2n1n0 00000000o3o2o1o0 00000000p3p2p1p0
Input (each number indicates color number as a byte in a longowrd):
1. 0 1 2 3
2. 4 5 6 7
3. 8 9 a b
4. c d e f
(swap 16) (1|2) (3|4)
1. 0 1 4 5
2. 2 3 6 7
3. 8 9 c d
4. a b e f
Nibbles or pass
output(Longword (nibles)):
1. 0 2 1 3 4 6 5 7
2. 8 a 9 b c e d f
input to blitter word(nibbles):
1. 0 2 1 3
2. 4 6 5 7
2. 8 a 9 b
3. c e d f
Blitter swap 8 (1x2) (3x4)
Output:
1. 0 2 4 6
2. 1 3 5 7
3. 8 a c e
4. 9 b d f
Input:
1. a3a2a1a0 c3c2c1c0 e3e2e1e0 g3g2g1g0
2. b3b2b1b0 d3d2d1d0 f3f2f1f0 h3h2h1h0
3. i3i2i1i0 k3k2k1k0 m3m2m1m0 o3o2o1o0
4. j3j2j1j0 l3l2l1l0 n3n2n1n0 p3p2p1p0
Blitter swap 2 (1x2) (3x4) (16 bit word merges)
Output:
1. a3a2b3b2 c3c2d3d2 e3e2f3f2 g3g2h3h2
2. a1a0b1b0 c1c0d1d0 e1e0f1f0 g1g0h1h0
3. i3i2j3j2 k3k2l3l2 m3m2n3n2 o3o2p3p2
4. i1i0j1j0 k1k0l1l0 m1m0n1n0 o1o0p1p0
Bitplane 1 and 2 should point to a buffer containing the words:
1 , 2 , 5, 6 ...
Bitplane 3 and 4 should point to a buffer containing the words:
3 , 4 , 7 ,8 ...
Then the scroll register of the odd bitplanes should be scrolled one pixel to the right.
this will emulate the last mergepass needed swap 1 (1x2) (3x4)
1 pixel becomes 100% accurate, the next pixel is garbage.
Now set up the 5th biplane with a shade mask like this (every other bit set):
x x x x x x x x x x
x x x x x x x x x
x x x x x x x x x x
The vertical (1x2) resolution can me made by using the copper and modulo.
How to adjust the color map
Colors 0-15 normal palette
colors 16-31 eighter black or...
Now comes the new idea:
Lets see what the garbage pixel is made of..
a2 bit 2 of the first pixel
b3 bit 3 of the second pixel
a0 bit 0 of the first pxel
b1 bit 1 of the second pixel.
With a sorted color map (sorted by f.eks sqr(r^2 + b^2 + c^2) this information can be used
to make pretty good prediction of the color of the pixel in between two actual pixels. Then you wil get a hardware smoothed screen all for free.
(no cycles required.. Only adjusting the colours 16-31)
The "garbage pixel contains 2 bits from each of the surrounding pixels. I guess the predicted smooth pixel will have 3 bit accurancy.
---
Code CPU pass. unoptimized. (Not pipelined for 020+)
.swap4c2p
movem.l (a0)+,d0-d3
move.l d1,d4 ;swap16 (1X2) (3X4)
move.l d2,d5
move.w d0,d4
move.w d3,d5
swap d4
swap d5
move.w d4,d0 ;1
move.w d5,d2 ;3
move.w d1,d4 ;2
move.w d3,d5 ;4
lsl.l #4,d0 ;Or nibbles (swap4)
lsl.l #4,d2
or.l d0,d4
or.l d2,d5
move.l d4,(a1)+
move.l d5,(a1)+
dbf d7,.swap4c2p
Or this loop don't know wich is the fastest on the Mc68000
.loop
REPT 4
move.w (a0)+,d0
lsl.w #4,d0
or.w (a0)+,d0
move.w d0,(a1)+
ENDR
dbf .loop
|
Kalms
Member |
Look, on the a500 you're usually better off mangling the texture a bit beforehand. Then you end up with something like:
; gather 8 pixels (say, for an unrolled tunneltable or a rotozoomer with
; SMCed pixel offsets)
(a0) => a3a2----a1a0----
(a1) => ----a3a2----a1a0
move.b 0x1234(a0),d6
or.b 0x1234(a1),d6
move.b d6,(a2)+
move.b 0x1234(a0),d7
or.b 0x1234(a1),d7
move.b 0x1234(a0),d6
or.b 0x1234(a1),d6
move.b d6,(a2)+
move.b d7,(a2)+
move.b 0x1234(a0),d6
or.b 0x1234(a1),d6
move.b d6,(a2)+
output format:
a3a2b3b2a1a0b1b0 e3e2f3f2e1e0f1f0 c3c2d3d2c1c0d1d0
g3g2h3h2g1g0h1h0
... then you need one 4bit merge to get to a1a0b1b0c1c0d1d0... format, and then you're free to either (a) rely on having the palette sorted, as you outlined above, or (b) do blitterscreen masking or (c) do half an 1bit merge to get 100% correct results.
|
Kalms
Member |
With regards to the "blurring" method you have outlined above, let us for example list all pixel-colour pairs where the "take bits 2&0 from the first pixel, and bits 3&1 from the second pixel" hashing operation produces the same hash:
5,8 => %11 %10
7,8 => %11 %10
11,8 => %11 %10
15,8 => %11 %10
5,9 => %11 %10
7,9 => %11 %10
11,9 => %11 %10
15,9 => %11 %10
5,12 => %11 %10
7,12 => %11 %10
11,12 => %11 %10
15,12 => %11 %10
5,13 => %11 %10
7,13 => %11 %10
11,13 => %11 %10
15,13 => %11 %10
Now one needs to find a palette colour which is a good approximation to all these pixel pairs at once.
Some of the colour pairs listed above are too far from each other (especially 5,8 versus 15,13) to give an accuracy of 3 bits in all cases; it looks more like the precision will be, um, 1-2 bits or so.
(A reasonable choice would be to map all those pairs to colour 11. That gives an absolute max error of 4 units, or about 1.5 bits of accuracy.)
Still, who knows - for some scenarios it might look better than blitterscreen.
|
sp_
Member |
I need to do some testing before I know how good the results will be.
Edit: Need time to think about this one..
|
sp_
Member |
Scrambling the chunky buffer is probobly the fastest yes. with SMC you just scramble the offsets as you mentioned. I will do some testing in winua. The blitter pass should perhaps be run in "blitter nasty mode" as the chipmem will go very slow with c2p and dma for a 5bplscreen/copper in the background.
I am not good in optimiziong for the mc68000 without fastmem but I will do some testing with (match a500 speed) in winuae.
|
sp_
Member |
The SMC loop it can be done faster. (save 2 moves) Like this:
move.w 0x1234(a0),d5
or.w 0x1234(a1),d5
move.b 0x1234(a0),d5
or.b 0x1234(a1),d5
move.w 0x1234(a0),d6
or.w 0x1234(a1),d6
move.b 0x1234(a0),d6
or.b 0x1234(a1),d6
move.w d5,(a2)+
move.w d6,(a2)+
|
Kalms
Member |
Regarding the SMC loop: The 68000 will produce a bus error if you try to fetch words from odd addresses. If you duplicate each byte twice (so you have 1 word per pixel) it'll work, albeit with half the texture resolution. Granted, it is probably a good tradeoff for the A500.
Regarding the blurring/hashing: one major problem is that 5,8 and 15,13 map to the same hash. There is no way to differentiate between those two colour combinations, and no single choice of hash-color will give you good results for both those pixel pairs.
As far as I can see there is no room to do non-linear weighting of the pixel-pair before the hash computation (as the has just is "take bits BD from pixel 1, and bits AC from pixel 2"). The root of the problem is that when skipping some higher-order bits (bit #3 in pixel 1, bit #2 in pixel 2) the information in the lower-order bits is near useless (it becomes more akin to white noise). What you need to get high quality results is bits #3 and #2 from each of the two pixels, anything else and the max error will be roughly 4 units.
What I mean with the 'white noise' behaviour is kind of the following... here is what all (N, 15) transitions hash to:
0, 15
2, 15 => %00 %11
8, 15
10, 15
1, 15
3, 15 => %01 %11
9, 15
11, 15
4, 15
6, 15 => %10 %11
12, 15
14, 15
5, 15
7, 15 => %11 %11
13, 15
15, 15
Notice that the different colours get interleaved during the hashing. This means that there's going to be a bit of jumpiness during any gradients.
Or, have a look at this example -- a linear ramp up, delta 1:
0, 1 => %00 %00
1, 2 => %01 %01
2, 3 => %00 %01
3, 4 => %01 %00
4, 5 => %10 %00
5, 6 => %11 %01
6, 7 => %10 %01
7, 8 => %11 %10
8, 9 => %00 %10
9, 10 => %01 %11
10, 11 => %00 %11
11, 12 => %01 %10
12, 13 => %10 %10
13, 14 => %11 %11
14, 15 => %10 %11
... and a linear ramp down, delta 1:
1, 0 => %01 %00
2, 1 => %00 %00
3, 2 => %01 %01
4, 3 => %10 %01
5, 4 => %11 %00
6, 5 => %10 %00
7, 6 => %11 %01
8, 7 => %00 %01
9, 8 => %01 %10
10, 9 => %00 %10
11, 10 => %01 %11
12, 11 => %10 %11
13, 12 => %11 %10
14, 13 => %10 %10
15, 14 => %11 %11
Which pixel pairs give the same hashes?
0,1 <=> 2,1
1,2 <=> 3,2
2,3 <=> 8,7
3,4 <=> 1,0
4,5 <=> 6,5
5,6 <=> 7,6
6,7 <=> 4,3
7,8 <=> 13,12
8,9 <=> 10,9
9,10 <=> 11,10
10,11 <=> (no match)
11,12 <=> 9,8
12,13 <=> 14,13
13,14 <=> 15,14
14,15 <=> 12,11
(no match) <=> 5,4
For *most* of the cases, it looks fine. Except for 5 of them; for instance, 11,12 and 9,8 map to the same hash. Why is this? because some high-order bits get ignored. So either the smooth black->white or white->black gradient is going to get noisy, no matter how the pixel pair is weighted when calculating the new palette.
I know I'm being a bit negative here, sorry about that, but I just don't see the usefulness in the 'blurring' approach as it is currently defined.
|
sp_
Member |
You are right. no room to do non-linear weighting here.
By scramblelign the txture. Swap bit 0 with bit 1
It is possible to join bits 2 and 3 from the left pixel and bit 0 and 1 from the the next pixel. Then the "calculated pixel" will look more like the left pixel since the most significant bits are being used
This remove some of the "noise".
Maybe blur definition is not the right term.
It's more like how to make a blitterscreen look bether without any extra cycles :D
As for Mc68000 optimizing I still have something to learn. I always optimized for Mc68020+ when I was active.
Anyway, I will code the c2p, messure the speed, and experiment with the colormaps, and blitter passes.
Combine this with SMC and hope to make a fullscreen txturemapped torus in 25 fps. on the Vanilla Amiga500 from 1986 :D
|
Kalms
Member |
Now that makes sense. One of the main problems with blitterscreen is that the image gets much darker; if the odd pixels can be just (color & 0xc) rather than pitch-black then it's going to look noticeably better. Good idea!
|
Azure
Member |
Well, for the cost of half a merge you could simply swap even and odd bits after the 2 bit merge. Then you end up with something like this:
old a3a2b3b2 c3c2d3d2 e3e2f3f2 g3g2h3h2
swapped a2a3b2b3 c2c3d2d3 e2e3f2f3 g2g3h2h3
old a1a0b1b0 c1c0d1d0 e1e0f1f0 g1g0h1h0
swapped a0a1b0b1 c0c1d0d1 e0e1f0f1 g0g1h0h1
mask 0 1 0 1 ...
On A1200 this is faster than doing a full merge since adding an additional bitplane does not eat that many bus cycles. Ironically, due to the lousy design of AGA, this trick can still be used in hires mode to get to 1x1 without a huge speed impedient. Well, this is basically how my old C2P worked.
I am sure I considered several ways to skip the bit swapping part , but it appears I did not find a solution. I dont remember what I tried back then..
Adding a sprite layer to 8 bitplanes will basically eat up the entire bus bandwidth, which is why blitterscreen is not that favorable. It does not only look ugly.
Not sure about A500 though. Adding a fifth bitplane may be worse than doing an additional blitterpass. Especially if you dont have fast mem.
On A500 I would try to use a scrambled 4bit chunky buffer, ideally already with merged pixels, so you only have to do 1.5 or 2 blitter passes. I would assume that moving around the chunky buffer with the CPU should be avoided at all costs.
|
sp_
Member |
Hey, Azure!
Nice idea indeed. Instead of scrolling the odd bitplane you simply but in a 010101 in the upper
bitplane and adjust the colors to perform a 1bit mergepass. this will produce 2 equal pixels. thus a
"perfect" 2x1 mode. I was thinking of a "blured" mode where 2 pixels formed one pixel.
Still need to experiment with this one.
As for the DMA used to produce a 5bpl screen I will make a small test program
and use a CIA timer to speed test the chipmem.
.
As my goal is to make the ultimate SMC txture polyfiller the best would be to use a
scrambled 4bit chunky buffer. wich is both scrambled in the txture. but also in
the bytes. Kalms suggested a nice loop above wich will do both the the byte scrambling
and the txture scrambling thus removing 2 c2p mergepasses. With a chunky buffer like this one I
only one blitter pass (swap 4)(nibble merger), and I can use your bitplane trick with 5bpl or another
half a blitter merge to get a finished color.
Txturemapp innerloop:
Given that I use interpolation between every 8 pixel, I would need 8 SMC "Kalms style" innerloops for the txturemappingpolyfiller.
Each loop mapping 1,2,4,5,6,7 or 8 pixels.
|
winden
Member |
I already had some code for tmapping with SMC on a500, by doing multiple copies for the SMC code-with-offsets depending on the alignement of the data relative to modulo-n X start position for the span.
This is to say when doing c2p with 4 pixels/word, we would need 4 copies of the mapper, each one filling from one of the 4 textures to a different part of the register.
the texture was cloned 4 times:
a3------a2------a1------a0------
--a3------a2------a1------a0----
----a3------a2------a1------a0--
------a3------a2------a1------a0
and screen format for a 1x1 4bpl screen was something like
a3b3c3d3 a2b2c2d2 a1b1c1d1 a0b0c0d0
you can see that this format is easy to rearrange with the blitter
into a proper screen.
The innerloop ended looking something like that:
offset_0:
move.w (a6),d0
move.w 1111(a0),d0
or.w 2222(a1),d0
or.w 3333(a2),d0
or.w 4444(a3),d0
move.w d0,(a6)+
...
offset_1:
move.w (a6),d0
and.w #%1110111011101110,d0
or.w 1111(a3),d0
move.w d0,(a6)+
move.w 2222(a0),d0
or.w 3333(a1),d0
or.w 4444(a2),d0
or.w 5555(a3),d0
move.w d0,(a6)+
...
offset_2:
move.w (a6),d0
and.w #%1100110011001100,d0
or.w 1111(a2),d0
or.w 2222(a3),d0
move.w d0,(a6)+
move.w 3333(a0),d0
or.w 4444(a1),d0
or.w 5555(a2),d0
or.w 6666(a3),d0
move.w d0,(a6)+
...
offset_3:
move.w (a6),d0
and.w #%1000100010001000,d0
or.w 1111(a1),d0
or.w 2222(a2),d0
or.w 3333(a3),d0
move.w d0,(a6)+
move.w 4444(a0),d0
or.w 5555(a1),d0
or.w 6666(a2),d0
or.w 7777(a3),d0
move.w d0,(a6)+
...
I recall trying various forms with 2bpl, using bytes instead of words, etc...
The trick Azure tells with the extra bitplane to skim on doing c2p passes was the one I used on trashcan 3 intro rotozoomer. Physical screenformat was a1a0b1b0c1c0d1d0 (needed 4 copies of the texture with different offsets) and at display I just ran this one data into 2 bitplanes (with one of them hardware scrolled 1 pixel), and a %10101010 mask for the third one, then adjust the palete so that visual image was a 2x1 screen.
|
sp_
Member |
Nice Winden.
With Kalms loop further up in this thread you need half the blitter passes needed here. Since your format a3b3c3d3 a2b2c2d2 a1b1c1d1 a0b0c0d0
will require both a swap4 and a swap 8 blitter pass.
..
Since I still am working fulltime as a coder I haven't had time to work om my a500 mapper yet. But now I only have 4 days left of work!! And after that I move to thailand to retire. ;)
Expect an amiga release "Made in Thailand" maybe for breakpoint? . hehe
|
sp_
Member |
Ok, I've finally implemented the first version of my a500 c2p now.
I need some help to time the speed on a real a500.
I have made a simple cia Timer wich will calculate the number of rasterlines needed.
When run in ASM-one or asmpro after clicking left mouse
write h.l blitterc2p . Post the number (in hex) here :D
I also need to test the speed of SMCTABLE: uncomment the call and timing methods in HOVED: loop. Run again and h.l smctable
...
c2p2x1_c0b1_scr_000.s
Scrambled chunkytoplanar (2x1 1 blitterpass)
Version 0.5
Implementation of the algorithms discussed on ada.underground.net
topic "Atari style c2p on amiga"
1 pass Blitter C2P suitable for amiga500 and MC68000
The chunky buffer is a scrambled nibble buffer in the following format
a3a2a1a0b3b2b1b0 c3c2c1c0d3d2d1d0 e3e2e1e0f3f2f1f0 g3g2g1g0h3h2h1h0
In the copper. plane 0,1 point to the same buffer and 1,2 the same buffer.
Scrollregister scrolls the odd plane one pixel to perform a merge.
in the 5th bitplane a mask of %1010101010101010 masks out the corrupted bits.
In the method SMCTABLE the is an example on how to use this chunkybuffer.
http://www.esnips.com/doc/2ffc19a3-0b87-4448-84ac- e0eba356fb52/c2p2x1_c0b1_scr_000
|
sp_
Member |
I uploaded a new version now. this version renders a 160*256 4bpl chunkybuffer
and perfoms a blitter c2p.
http://www.geocities.com/hallu_a/c2p2x1_c0b1_scr_0 00.asm
Edit: winuae timings where incorrect.
|
Toffeeman
Member |
Good to see you doing some A500 hardware stuff :0)
Didn't Chaos use a blitter C2P for his "Amiga Rules" 2*2 rotator in WOC? I think he improved it for Arte and Roots as there seems to be several 2*2 chunky pixel effects in those demos. How does your routine compare/differ to that one ?
|
jar
Member |
Dumb n00b questions:
Is it possible to extend this method to 640x256 medres in order to get a 320x256 chunkybuffer? Would it be unusably slow?
What is the fastest way to do c2p on a plain A1200?
|
ultra
Member |
moin,
Toffeeman:
----------
the woc rotzoomer works like this:
move.b $xxxx(ax),d0 ;11000000
or.b $xxxx(ax),d0 ;11220000
or.b $xxxx(ax),d0 ;11223300
or.b $xxxx(ax),d0 ;11223344
move.b d0,(ax)+
/.. unrolled loop ../
as you can see... one bitplane ... the colors/motionblur is done by rendering each frame into a different bitplane...
so the c2p is done via a planar table by oring the bits into one byte...
the arte rotzoomer innerloop is a bit longer... i'll do not post the code here... and a bit "harder" to understand because of the length...
it accesses two times per 4 pixels a planar table... by reusing the old planardata from the prevoius step
and using different scrambled planardata it writes in the end one word into each bitplane... it's a bit compareable with the st c2p... but without using movep...
so the main difference is that both rotzoomers are using planar tables to do the c2p... no blitter at all...
the blitter c2p above doesn't need this... the c2p is done fully via the blitter and some tricks with bitplanes... so not compareable to these c2ps.
the sanity c2p is slower of course... but the above blitter c2p has the disadvantages that in a resolution of 2*x c2p 1 pixel is sometimes garbage... but still... even with a different blitter c2p with correct 2*x pixels it would be faster than the sanity c2p... when you use 4 bitplanes... with one like the woc rotzoomer a blitter wouldn't help... hm the woc rotzoomer could be optimised a bit... by creating .w planar and writing back to screen with movem.w...hm anyway ;)
jar:
----
on a500 to use med res to create a 1*1 c2p with the blitter c2p above is pretty unusable...
the display leeches 160 cycles from the available 226 cycles per scanline...in low res it's only 80...
so blitter and cpu slows down dramatically...
for 1*1 c2p it's better i guess to use a multipass blitter c2p like winden posted above in lo res...
on a1200 maybe different read what azure wrote ;) or have a look to kalms c2p routines there is surely one
for a1200...
greetz ultra
|
z5_
Member |
off-topic:
@ultra: welcome :o) Hope you will do another Amiga production someday in the future (and not after another 18 years) :o)
|
ultra
Member |
hi z5...
unfortunately i discovered this forum very late... after sp posted on pouet...
@amiga stuff... i think i'll do... at the moment i'm more motivated to do a500 things ;)
|
StingRay
Member |
hey ultra, nice to see you here too. moin :P
for those of you who don't look on pouet, I'll post the message here too. :) Copy/pasted from my pouet message. =)
Ok, I can give results now after I added a nifty little "show number of used rasterlines on screen" routine so I just could run the exe on my A500. :)
5bpl version: 135/136 rasterlines
4bpl version: 128/129 rasterlines
blitter nasty was off. and i did test it on my chipmem only a500. ;)
updated source with the rasterline display can still be found at http://stingray.untergrund.net/c2p/c2p2x1_c0b1_scr _000_fixed.s
oh and I of course only timed the actual c2p, not the smctable stuff. =)
|
d0DgE
Member |
eheh, yeah... had a nice time testing the thing on my 500 :)
actually finding a not-wasted-by-ages disk was kind of a quest \o/
|
sp_
Member |
@kalms:
Regarding the SMC loop: The 68000 will produce a bus error if you try to fetch words from odd addresses. If you duplicate each byte twice (so you have 1 word per pixel) it'll work, albeit with half the texture resolution. Granted, it is probably a good tradeoff for the A500.
.
By duplicating each byte twice I can have max 256*128 txture but then I can use the following loop. If I shrink the txture to 16x8 I can precalc 2 pixels in one move --> and double speed(Same loop but remove all the or instructions). This loop is probobly optimal for smc tableeffcts when using this blitter c2p teqnique.
REPT WIDT*HEIGHT/32
move.w 0000(a0),d0
or.w 0000(a1),d0 ;ab00
move.w 0000(a0),d1
or.w 0000(a1),d1 ;cd00
move.b 0000(a0),d0
or.b 0000(a1),d0 ;abef
move.b 0000(a0),d1
or.b 0000(a1),d1 ;cdgh
move.w 0000(a0),d2
or.w 0000(a1),d2
move.w 0000(a0),d3
or.w 0000(a1),d3
move.b 0000(a0),d2
or.b 0000(a1),d2
move.b 0000(a0),d3
or.b 0000(a1),d3
move.w 0000(a0),d4
or.w 0000(a1),d4
move.w 0000(a0),d5
or.w 0000(a1),d5
move.b 0000(a0),d4
or.b 0000(a1),d4
move.b 0000(a0),d5
or.b 0000(a1),d5
move.w 0000(a0),d6
or.w 0000(a1),d6
move.w 0000(a0),d7
or.w 0000(a1),d7
move.b 0000(a0),d6
or.b 0000(a1),d6
move.b 0000(a0),d7
or.b 0000(a1),d7
movem.w d0-d7,-(a2)
ENDR
|
|
|