|
Author |
Message |
BalrogSoft
Member |
Hi, i made my first demoscene effect in assembler, i made little things in assembler some years ago, but now i'm learning a lot. I decided to make my first assembler effect this weekend, and i got it! I made a Chunky to Copper routine, and a rotozoomer effect. On my A600 is very slow, but on my A1200 with a 060, it runs very smoothly. I'm not very good with assembler (i'm a newbie), if somebody can give some suggestions to optimize my code, i will share my code:
ASM Source code: URL
Executable: URL
Thanks in advance!
|
z5_
Member |
@BalrogSoft: Wow, this is cool. Congratulations! Isn't a rotozoomer quite a difficult effect to code? I ran it in Winuae and it looked nice. Damn, i never got that far.
I really hope that experienced coders will step in and help you here. the first steps are always to most difficult. Keep it up!
|
Toffeeman
Member |
First please remember I haven't coded anything for 12 years in 68k assembler ! Can anyone get me a copy of Devpac 3 ? I did notice a few things in the code though quickly looking at it.
1. The main reason for the low speed is the call to the GetUVTexture function as it calculates sin+cos for every pixel you draw. This needs a multiply instruction which on the 68000 takes like 45 clock cycles or something like that (you will need to look up the exact instruction timings). Anyway to speed this up you really need to change your algorithm so the only thing you need todo for each pixel is add on an x and y step through your texture value to your x and y texture co-ordinates. Basically at the start of the frame you calculate the 4 points of a rotated rectangle and then you calculate the values you need to step through your rotated texture for x and y. This is properly explained here:-
http://www.flipcode.com/articles/demomaking_issue1 0.shtml
2. I'd make the calls to GetUVTexture and DrawPixel inline as well. I know in programming it's good practise to have re-usable code but in this instance speed is the main thing here and you'd save (4 instructions) * (pixel count) for each frame.
3. I'd store all the most used variables in the pixel loop within the CPUs internal registers as well.
4. When you have increased your frame rate and you want a higher resolution then you can use the bitplanes to hold the colours. This is because the copper can only change colour every 8 pixels but if you start setting colours at the very start of the scan line and getting the bitplanes to store the colours then you can get smaller pixel sizes.
|
winden
Member |
balrog/llfb I assume?
something for when you advance some more... rotozoom can be said to be this formula:
u(x,y) = x * dudx + y * dudy
v(x,y) = x * dvdx + y * dvdy
please notice that:
u(x+1,y) - u(x,y) =
((x+1) * dudx + y * dudy) - (x * dudx + y * dudy) =
((dudx + x * dudx + y * dudy) - (x * dudx + y * dudy) =
dudx
so, when you change from (x,y) to (x+1,y), your UV values change linear and can be done by just using an add.
rearranging the above math, you can also do this trick:
a[x] = x * dudx + M * x * dvdx
b[y] = y * dudy + M * y * dvdy
and then uv(x,y) = a[x] + b[y]
lots of things elided of course, but with time you can really heavily optimise the routine :)
|
BalrogSoft
Member |
Thanks a lot to all, i made a new version, i upload it with the same names.
Wow, this is cool. Congratulations! Isn't a rotozoomer quite a difficult effect to code? I ran it in Winuae and it looked nice. Damn, i never got that far.
Thanks a lot!, well a rotozoomer is a very easy effect, and chunky2copper is the easy way to use chunky screens on Amiga, i started with an easy demoscene effect. Obviously it will depend on your programming skills, i'm a proffesional programmer, but i use high level languages on my work. If you have experience in other languages, it's a good option to design first the effect in a high level language, and then port to assembler. I worked last 2 years programming games for mobile phones, and it usually require a lot of optimization (runtime and jar size), Fixed point math, and it helps also to code for Amiga.
First please remember I haven't coded anything for 12 years in 68k assembler ! Can anyone get me a copy of Devpac 3 ? I did notice a few things in the code though quickly looking at it.
1. The main reason for the low speed is the call to the GetUVTexture function as it calculates sin+cos for every pixel you draw. This needs a multiply instruction which on the 68000 takes like 45 clock cycles or something like that (you will need to look up the exact instruction timings). Anyway to speed this up you really need to change your algorithm so the only thing you need todo for each pixel is add on an x and y step through your texture value to your x and y texture co-ordinates. Basically at the start of the frame you calculate the 4 points of a rotated rectangle and then you calculate the values you need to step through your rotated texture for x and y.
Thanks a lot, i implemented this routine, and now it works at decent speed on a plain A1200.
balrog/llfb I assume?
something for when you advance some more... rotozoom can be said to be this formula:
u(x,y) = x * dudx + y * dudy
v(x,y) = x * dvdx + y * dvdy
Yes, i was balrog/llfb... but on those times i used Amos and Blitz basic more than assembler. And yes, i used this formulas for my new routine, thanks a lot.
|
sp_
Member |
If you followed the interpolation algoritm outlined above you should end up with an innerloop that look something like this:
6 inst. pixel mapper.
.loop
move.w d0,d5
move.b d1,d5
addx.l d2,d0
addx.l d3,d1
move.b (a0,d5.l),(a1)+
dbf d7,.loop
To speed up more you could Use Self modified code.
Interpolate every 16 pixel and precalculate 15 txture offsets directly into the code. (once pr frame)
;Self modified code loop. 16pixels : (22 instructions 16 pixels) 1,375 inst. pr pixel
moveq.l #0,d5
.loop16
lea (a0,d5.l),a6
move.b (a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.b 0000(a6),(a1)+
move.w d0,d5
move.b 0000(a6),(a1)+
move.b d1,d5
move.b 0000(a6),(a1)+
addx.l d2,d0
move.b 0000(a6),(a1)+
addx.l d3,d1
move.b 0000(a6),(a1)+
dbf d7,.loop16
More about innerloop optimizing in this thread:
http://ada.untergrund.net/forum/index.php?action=v thread&forum=4&topic=190
|
sp_
Member |
I have helped to optimize.
Optimized source:
Link to sourcecode
New innerloop(can be faster):
NextX:
add.l d5,d7 ; xx00YYyy
addx.l d4,d6 ; 000000XX
move.w d7,d0
move.b d6,d0
move.w (a5,d0.w*2),(a0) ; Write texture color on copper list
adda.l #4,a0 ; Next pixel * 2
dbf d2,.NextX
|
Toffeeman
Member |
LOL I finally see how the self modifying code mapper works now. Do you need to flush the cash manually once you've modified the code with the code otherwise wouldn't the cache be reading the original code. Maybe the cache knows it's changed and reloads it ?
I guess you would have a move.b 0000(a6),(a1)+ for every pixel you need to write in the x axis. So for a 320*200 screen you would have 320 of them and fill them in at the start of the effect and re call it for every line.
Great idea try doing that in C :0)
|
winden
Member |
tofee, you have to clear manually in both m68k and ppc machines, there is no auto-clean like on x86.
|
Toffeeman
Member |
Thanks for the info Winden. Do you need to use this method (self modifying code) todo full screen copper chunky rotation say 3*3 on a stock A1200 in full frame rate ? For example the copper chunky rotator by Gengis in Complex Origin seemed to be a very good example.
|
rload
Member |
Exactly how does one mark an area of memory as non-cacheable using the MMU ??
|
winden
Member |
pages 4-12 and 4-13 on 68060 UM...
CM field in bits 6 and 5:
00 = cacheable, writethrough
01 = cacheable, copyback (default for amiga fastmem I think)
02 = nocache, precise interrupts
03 = nocache, imprecise interrupts (default for amiga chipmem, I think)
but maybe it would be better doing it by using transparent translation registers (page 4-6), which can set a big block of memory and don't slow down processing with mmu descriptor misses.
|
BalrogSoft
Member |
Thanks again for your help! i will take a look to the changes made on the code, but i think that self modifing code is to much for me at this moment.
|
z5_
Member |
Thanks again for your help! i will take a look to the changes made on the code, but i think that self modifing code is to much for me at this moment.
Indeed, keep in mind that coders like Winden, sp, Loaderror and nearly all others replying here are already very experienced. So it might be a good idea to take it step by step and leave the most hardcore optimising till later :o)
|
sp_
Member |
Just to complete the SMC toturial, here is a SMC version wich interpolate every 8th pixel and use smc to fill the rest. It runs 50fps in 320*256 (8x8) on plain a500. A faster version will be what Toffeeman suggested to unroll the xloop precalculate one move pr x coordinate.
.
No cache clearing in this code. If you want it to run in WinUAE switch of JIT setting in the CPU menu.
URL
|
winden
Member |
Balrog, dont worry and get your own pace... if you continue there will be a moment when going into SMC will be the natural thing to finish optimising the routine :)
|
BalrogSoft
Member |
Another question, i'm trying to make a 4x1 pixel definition using an AGA copper list, but i can't get it working, i read old messages on news groups about AGA copper. I added some instructions, i set FMODE and bplcon3 registers, but it don't work, i saw a lot of codes that work, but i can get the need informtion to make the changes on my code, i have some questions after see differents examples.
The examples i tested, create a copper list with 7 bitplanes, change colors of differents color banks, but chunky 2 copper routine work setting color0? then why AGA Chunky2copper changes others colors?
What need to be added to a copper list to have an AGA copper list with a 4x1 pixel definition? i made the swaps of hi and low words on copper list colors, but i got 8x1 pixel definition. I need to add 7 empty bitplanes?
I found on google groups, old messages explaining chunky2copper tips, but i can get it working, i read that you can get 4x1 pixel definition with ECS and 2x1 with AGA (Alien breed 3D use a copper list and have 2x2 pixel definition). Somebody can explain this tip better?
http://groups.google.es/group/comp.sys.amiga.progr ammer/browse_thread/thread/ec4c0c74850913fe/06f1e2 632d149d7d?lnk=st&q=12bit+aga+copper+list&rnum=4&h l=es#06f1e2632d149d7d
http://groups.google.es/group/comp.sys.amiga.progr ammer/browse_frm/thread/89345817f7afa85f/7a49dda71 191243a?lnk=st&q=aga+copper+list+fmode&rnum=2&hl=e s#7a49dda71191243a
|
Toffeeman
Member |
So many peope seem to think copper can change colour every 4 pixels but it's 8 and has always been 8. The AGA copper is exactly the same as the one in the original Amiga. Really pisses me off but the 32 bit Copper was in the AAA chipset and it also had move multiple as well. Mind you AAA had a true colour 16 bit mode anyway !
Heres an explantion from the man Kalms on a previous answer to my question about blitting into the hardware registers.
* The original A500 CPU clock (7.14MHz) corresponds to one clock cycle = one lowres pixel
* In both OCS & AGA machines, the chipbus is clocked at half the original A500 CPU clock speed => one chipbus cycle = two lowres pixels
* In both OCS & AGA machines, the blitter can utilize only every other chipbus cycle => one blitter access = four lowres pixels
* To update one colour00 register, one read and one write must be performed => one colour update = eight lowres pixels
So there you go and by setting Fmode register you only effect the display hardware (including sprites). It's not going to make the copper any faster although it will give you more DMA time for blitter etc.
So to have smaller pixel sizes you have to use the bitplanes to hold the colours in while the copper changes the other colours over 2 scan lines. You can't just use colour 0 for anything less than 8 lowres pixels. So you are going to need todo it the way your example code you downloaded does it.
BTW Dr Skull found by changing the Amigas display setting registers that you could get more copper moves into a scan line.
|
Kalms
Member |
Note that the aim of the above discussion about blitting into color register is to show that blitting will not run any faster than just having a static copperlist.
The static copperlist is also limited to one color write per 8 cycles, because the copper uses the same bus cycles as the blitter, and it needs to read two words per MOVE instruction.
And regarding the "more copper moves per scanline"...
This is the crucial correlation:
time spent per scanline * number of scanlines * display refresh (Hz) = 1 second
When you switch from 50 to 60Hz, you increase display refresh by 20% and decrease number of scanlines to compensate for that.
The "more copper moves per scanline" tweak would increase the time spent on each scanline, and trade that off either by having less number of scanlines per frame, or by having lower display refresh.
But TVs don't accept any sort of input signal; rather, they expect the signal to conform to a few standards. If the signal is slightly out of whack, the TV auto-corrects it.
The reason why TVs do this auto-correction, is that it is common that an air-bound TV signal gets slightly distorted due to the signal transmission bouncing against a lot of objects during its journey from the transmission tower to your TV antenna. The scanline length has the highest resolution and therefore a TV usually includes a stabilizer on that signal. There's no need to do stabilization on the vertical refresh rate though, because the variations will be relatively small compared to the refresh rate.
Specifically, both standard 50Hz and 60Hz modes have 64 microseconds per scanline. TVs expect this. If you fudge and increase the time, the picture on the TV will look OK... up until a certain point. When you reach that point, the TV will no longer draw the rasterlines on the screen in the correct positions. What usually happens is that lines start randomly jumping rightward on the screen, the image gets darker, etc.
Some equipment that works in the digital domain (such as video projectors) will not like non-standard signals AT ALL. I suspect that some TFT screens could run into trouble when fed a PAL/NTSC signal with nonstandard timing too. And how about WinUAE? (I haven't tested myself)
Therefore I recommend that you stick to the predefined resolutions - not doing so limits your potential audience, plus you run the risk of not being able to participate in a compo because the compo equipment is unable to show your demo.
|
rload
Member |
@winden thanks for the info!
|
|
|