A.D.A. Amiga Demoscene Archive

Amiga Demoscene Archive Forum / Coding / Optimizing a doomclone innerloop (fears 1995)

Author	Message
sp_ Member	#1 - Posted: 28 Nov 2008 05:19 - Edited Reply Quote TP94 3rd place. Motion by bomb use a 3x2 copperchunky in the last doom effect. This routine became the game called "Fears" in 1995 FEARS is the first doomclone for amiga. developed by Bomb. Here is a Youtube clip running the game engine on Cd32. (amiga1200 14mhz 2meg chip) http://www.youtube.com/watch?v=mbnif3G1wyU Innerloop: .loop move.w d1,d4 ;2 and.w d2.d4 ;2 move.w (a4,d4.w),(a6) ;7 adda.w d0,a6 ;2 add.w d3,d1 ;2 dbf d5,.loop ;3 18 cycles + agu stall. pr pixel The doom sequence in "Motion" (TP 1994 3rd place demo) innerloop by the same group: .loop move.l d4,d1 ;2 swap d1 ;4 and.l d2,d1 ;2 move.w (a4,d1.w),(a6) ;7 add.l d3,d4 ;2 lea $184(a6),a6 ;4 dbf d5,.loop ;3 24 cycles + agu stall. pr pixel SP SMC Optimized: .loop8 move.w (a4),(a6) ;6 add.l d6,a6 ;2 move.w 0000(a4),(a6) ;6 add.l d6,a6 ;2 move.w 0000(a4),(a6) ;6 add.l d6,a6 ;2 move.w 0000(a4),(a6) ;6 add.l d6,a6 ;2 move.w 0000(a4),(a6) ;6 add.l d6,a6 ;2 move.w 0000(a4),(a6) ;6 add.l d6,a6 ;2 move.w 0000(a4),(a6) ;6 add.l d6,a6 ;2 move.w 0000(a4),(a6) ;6 add.l d6,a6 ;2 add.l d3,d4 ;2 and.l d2,d4 ;2 lea (a5,d4.w),a4 ;6 dbf d5,.loop8 ;3 77 cycles 8 pixels. 9,6 cycle pr pixel. Almost 2 times the speed... It can run smooth on a standard a1200! with some work.
dalton Member	#2 - Posted: 29 Nov 2008 23:26 Reply Quote you should release a patch for the game =) it looks painfully slow!
sp_ Member	#3 - Posted: 30 Nov 2008 05:54 - Edited Reply Quote I am searching for the fastest doomclone on the standard a1200/a500. Fears used up to 2 longword divs and 2 mulses pr poligon scanline. The divses can be replaced by table-reads, and the muls can be word size. Now over to more famous games: Gloom deluxe wallrender innerloop 1 pixel: (The exe contains many different innerloops for transparency,floor etc.) .loop move.b (a3,d2.w),d5 move.b (a4,d5.w),(a1) ;shadetable access addx.l d3,d2 add.l d0,a1 dbf d4,.loop The c2p I found in the exe was not the fastest possible. WallRenderer AlienBreed 3D(The exe contains many different innerloops for transparency,floor etc.) .loop move.w (a4,d1.w2),d3 ;Read texture blt.b .skip move.w (a5,d3.w2),d3 ;transparancy txture shadetable?? move.w (a2,d3.w*2),(a6) ;shadetable?? .skip add.w #$1a0,a6 addx.l d2,d1 dbf d4,.loop This is perhaps not the correct innerloop as it uses 2 shadetable lookups. The add immidiate is slower than register add. . Both innerloops use pr pixel shading with a table-lookup pr pixel. A faster aproach would be to use a mip-map renderer with multiple shaded txtures in memory. The shaded textures must be generated on a 256x256 picture Another speedup could be to render two(or more) horisontal pixels in the same loop using two(or more) interpolation registers. gloomdeluxe speedup: interpolation for every 4th pixel. 2 interpolations .smcloop move.w (a3),d0 ;3 move.b (a4),d0 ;3 move.w 0000(a3),d1 ;3 move.b 0000(a4),d1 ;3 move.w 0000(a3),d2 ;3 move.b 0000(a4),d2 ;3 move.w 0000(a3),d3 ;3 move.b 0000(a4),d3 ;3 move.w d0,(a1) ;3 add.l a0,a1 ;2 move.w d1,(a1) ;3 add.l a0,a1 ;2 move.w d2,(a1) ;3 add.l a0,a1 ;2 move.w d3,(a1) ;3 add.l a0,a1 ;2 addx.l d4,d5 ;2 addx.l d6,d7 ;2 add.w d5,a3 ;2 add.w d7,a4 ;2 subq.l #1,a1 ;2 bne.b .smcloop ;2 56 cycles/ 8 pixels= 7 cycles pr pixel. (gloom's innerloop is 18 cycles pr. pixel.) The Alienbreed loop would look similar. . I think most doomclones on amiga can be doubled in speed with some work...(on a lowend amiga)
sp_ Member	#4 - Posted: 30 Nov 2008 10:15 - Edited Reply Quote ;Fast Wall render innerloop for amiga 500/a1200 to Copperchunky 12bit ;The loop will plot 32 pixels 4 pixel wide and 8 pixel high. .loop move.w (a0),(a4) move.w (a1),4(a4) move.w (a2),8(a4) move.w (a3),12(a4) add.l a5,a4 REPT 7 move.w 0000(a0),(a4) move.w 0000(a1),4(a4) move.w 0000(a2),8(a4) move.w 0000(a3),12(a4) add.l a5,a4 ENDR addx.l d4,d5 addx.l d5,d6 addx.l d0,d1 addx.l d2,d3 add.w d5,a0 add.w d6,a1 add.w d1,a2 add.w d3,a3 subq.l #1,a6 bne.b .loop Mc68000: 632 cycles/32 pixels. 19,75 cycles pr pixel Mc68020: 228 cycles/32 pixels. 7,125 cycles pr pixel
sp_ Member	#5 - Posted: 30 Nov 2008 10:31 Reply Quote On 020+ 6 adds can be removed with the loop under resulting in 7,0625 cycles pr pixel .loop move.w (a0),(a4) move.w (a1),4(a4) move.w (a2),8(a4) move.w (a3),12(a4) move.w 0000(a0),width(a4) move.w 0000(a1),widht+4(a4) move.w 0000(a2),widht+6(a4) move.w 0000(a3),widht2+8(a4) move.w 0000(a0),width2(a4) move.w 0000(a1),widht2+4(a4) move.w 0000(a2),widht2+6(a4) move.w 0000(a3),widht*2+8(a4) ... add.l a5,a5 addx.l d4,d5 addx.l d5,d6 addx.l d0,d1 addx.l d2,d3 add.w d5,a0 add.w d6,a1 add.w d1,a2 add.w d3,a3 subq.l #1,a6 bne.b .loop
Azure Member	#6 - Posted: 30 Nov 2008 15:23 Reply Quote I think you are just developing the valuable insight that the people who spend time to squeeze out every bit out of innerloops are usually not the same people who get things done :)
Azure Member	#7 - Posted: 30 Nov 2008 15:24 Reply Quote AFAIK Wolfenstein3d used SMC with an unrolled inner loop, very similar to what you are proposing.
sp_ Member	#8 - Posted: 2 Dec 2008 08:29 - Edited Reply Quote When The doomclones started to appear on amiga in 1995 it was a revolution. Few people thought it was possible to do it on the standard 14mhz 020. After 1995 most amiga users moved to higher processors, and optimizing for the old hardware wasn't important anymore. Selfmodified code is not new. On the C64 I think most democoders use it, on atari its been used for a long time. The rotozoomer in Chaos Land by VD 1993 use SMC, and probobly some other old amiga 500 intros/demos. . I understand the Alienbreed innerloop now. Textures are not 16 bit in memory, but 8 bit with a colormap. This will double the amount of textures, but slow the renderer with one memory move pr. pixel. With only 2meg chip to play with, this was a fast way to get more txtures. The shadetable is 40962 bytes long. probobly with 16 or 8 shades. So 4096216 bytes in memory. move.b (a4,d1.w),d3 ;Read texture blt.b .skip move.w (a5,d3.w2),d3 ;convert to 12 bit move.w (a2,d3.w*2),(a6) ;shadetable and plot .skip add.w #$1a0,a6 addx.l d2,d1 dbf d4,.loop
Crumb Member	#9 - Posted: 2 Dec 2008 18:24 Reply Quote What about?: -Genetic Species http://aminet.net/search?query=genetic+species -Nemac IV http://aminet.net/search?query=nemac -BreathLess http://aminet.net/search?query=breathless http://www.lemonamiga.com/games/details.php?id=227 -Testament http://www.lemonamiga.com/games/details.php?id=220 8 -Death Mask http://www.lemonamiga.com/games/details.php?id=311 -Ambermoon http://thalion.exotica.org.uk/games/ambermoon/amig a/ambermoone.zip -Space Hulk http://www.lemonamiga.com/games/details.php?id=133 4 -the real Doom (ADoom, DoomAttack) http://aminet.net/game/shoot/ADoom-1.3.lha http://aminet.net/package/game/shoot/DoomAttack -Quake...
sp_ Member	#10 - Posted: 3 Dec 2008 13:26 - Edited Reply Quote I have downloaded the alienbreed 3d II sourcecode and its a mess. I tried to search for for innerloops by searching for the addx instruction but ended up finding a horribly slow bytemove to chipmem c2p. The game is very nice in winuae, and runs fullscreen 50 fps, so why bother to it,.? :D Most of the games you listed here are made for upgraded amigas. I am searching for a fast doomclone on a500 or standard a1200. When Doom was ported to amiga, I remember we had some discussion somewhere on how to improve the speed. Innerloops and c2p. This took place on IRC or in some newsgroups, I don't remember. On 060 the original doom is faster than all it's clones(if I remember correctly) This is the result of the power of opensource and skilled democoders with years of experience.
Crumb Member	#11 - Posted: 3 Dec 2008 22:14 Reply Quote Death Mask, Space Hulk and AmberMoon run on a standard A500. AFAIK Ambermoon was more or less optimized. Perhaps it would worth a look. Testament requires a standard A1200. But you are 100% right with Breathless, NemacIV and Genetic Species as these games will require an accelerator to work smoothly. BTW, hats off for your work optimizing innerloops :-) Is it possible to write self-modifying code that works correctly with 040/060? have you tried to create self modifying code for 040/060 that fits inside the cache and works fast?
sp_ Member	#12 - Posted: 9 Dec 2008 03:00 - Edited Reply Quote On 040/060 selfmodified code won't speedup. The CPU is able to calculate instructions while writing to slow memory (Pipelining). A Mc68030 clocked at 50mhz is able to do 6 cycles for free while writing a longword to fastmem. .loop move.l d1,(a0)+ add.l d1,d2 add.l d3,d4 add.l d5,d6 dbf d7,.loop is as fast as .loop move.l d1,(a0)+ dbf d7,.loop On 060 two instructions can be executed at the same time if they don't share registers. I don't remember how many cycles its possible to get for free after a fastmemwrite, but I assume its at least 6. This meens 12 instructions if pipelined for free. .loop move.l d1,(a0)+ add.l d0,d0 add.l d1,d1 add.l d0,d0 add.l d1,d1 add.l d0,d0 add.l d1,d1 add.l d0,d0 add.l d1,d1 add.l d0,d0 add.l d1,d1 add.l d0,d0 add.l d1,d1 ... dbf d7,.loop . To optimize on the 040/060 it will help to improve cache hits. If the textures are flipped in memory. (X and Y axis are swapped). A vertical wallrenderer will more likely read cached texture data. Plotting more than one horisontal pixel pr loop as suggested in the optimized loops above will also increase cachehits. Using mip-maps.(smaller textures for smaller poligons) might help too. The Mc68060 doesn't have divs.l or muls.l built-in so these instructions have to be emulated. A wordsize divs is 32bit / 16 bit and should be enough for calculating interpolations. A muls is much faster than a divs(2 cycles). If you divide by less than 256you can use <<8 fixedpoint and muls.w. number/n = 1/n *number
dalton Member	#13 - Posted: 9 Dec 2008 08:37 Reply Quote The main problem with smc on cache-based CPUs is that the instruction cache isn't writeable. To change an instruction you'd have to write it to data cache first which will then update main memory (eventually), and then there's still no detection mechanism to mirror the main memory into instruction cache when main memory is updated. So basically you'd have to flush instruction cache so that's it's all reloaded. Lot's of pain. No gain! Much more interesting to exploit benefits of cache architecture than trying to make smc work!
Azure Member	#14 - Posted: 9 Dec 2008 21:44 Reply Quote but the 040 and 060 have a writeback cache. With good cache management you should not have a lot of waitstates after writing.
sp_ Member	#15 - Posted: 11 Dec 2008 16:37 - Edited Reply Quote The 060 has a 16-byte cacheline. I assume if you render pixels horisontally. the copybackcache will collect 16 bytes in the cache before the data is pushed to memory. This meens that you get a waitstate for every 16 pixel only.(when the cashline is moved to memory.) a1 must be alligned to a 16byte boundery .loop16 move.b (a0,d0.w),(a1)+ swap d0 move.b (a0,d1.w),(a1)+ swap d1 move.b (a0,d2.w),(a1)+ swap d2 move.b (a0,d3.w),(a1)+ swap d3 move.b (a0,d4.w),(a1)+ swap d4 move.b (a0,d5.w),(a1)+ swap d5 move.b (a0,d6.w),(a1)+ swap d6 move.b (a0,d7.w),(a1)+ swap d7 move.b (a0,d0.w),(a1)+ move.b (a0,d1.w),(a1)+ move.b (a0,d2.w),(a1)+ move.b (a0,d3.w),(a1)+ move.b (a0,d4.w),(a1)+ move.b (a0,d5.w),(a1)+ move.b (a0,d6.w),(a1)+ move.b (a0,d7.w),(a1)+ ;The last byte of the cacheline is filled and 16 bytes are moved to memory ... interpolationcode for 16 pixels is placed here. Probobly 0 cycles, since the bus is busy writing to memory. ...
sp_ Member	#16 - Posted: 13 Dec 2008 17:48 - Edited Reply Quote I found a wolfenstein port for the Atari ST. The author claims the game runs in 15fps. on 8mhz Mc68000 2meg ram If any Atari coders could disassemble it and give me the innerloop it would be interesting.. http://freenet-homepage.de/ray.tscc/wolf3d.htm
sp_ Member	#17 - Posted: 30 Apr 2011 20:00 - Edited Reply Quote I am planning to patch AB3dIhy I to use truecolor mode by modyfying the shadetables.. Every pixel is rendered through a 8bit shadetable... move.b (a0,d0.w),(a1)+ .... How about expanding the shadetables to 24bit :D To give the game a fresh look.. What do you think?
Blueberry Member	#18 - Posted: 2 May 2011 12:41 Reply Quote I think it is a waste of good processor cycles (and memory) to have a certain cache miss on every blend operation, when a full alpha blend can be computed in less time. :)

A.D.A. Amiga Demoscene Archive, Version 3.0