A.D.A. Amiga Demoscene Archive

        Welcome guest!

  

  

  

log in with SceneID

  

Demos Amiga Demoscene Archive Forum / Coding / Optimizing a doomclone innerloop (fears 1995)

 

Author Message
sp_
Member
#1 - Posted: 28 Nov 2008 05:19 - Edited
Reply Quote
TP94 3rd place. Motion by bomb use a 3x2 copperchunky in the last doom effect. This routine became the game called "Fears" in 1995 FEARS is the first doomclone for amiga. developed by Bomb.

Here is a Youtube clip running the game engine on Cd32. (amiga1200 14mhz 2meg chip)
http://www.youtube.com/watch?v=mbnif3G1wyU

Innerloop:

.loop
move.w d1,d4 ;2
and.w d2.d4 ;2
move.w (a4,d4.w),(a6) ;7
adda.w d0,a6 ;2
add.w d3,d1 ;2
dbf d5,.loop ;3

18 cycles + agu stall. pr pixel

The doom sequence in "Motion" (TP 1994 3rd place demo) innerloop by the same group:

.loop
move.l d4,d1 ;2
swap d1 ;4
and.l d2,d1 ;2
move.w (a4,d1.w),(a6) ;7
add.l d3,d4 ;2
lea $184(a6),a6 ;4
dbf d5,.loop ;3

24 cycles + agu stall. pr pixel


SP SMC Optimized:
.loop8
move.w (a4),(a6) ;6
add.l d6,a6 ;2
move.w 0000(a4),(a6) ;6
add.l d6,a6 ;2
move.w 0000(a4),(a6) ;6
add.l d6,a6 ;2
move.w 0000(a4),(a6) ;6
add.l d6,a6 ;2
move.w 0000(a4),(a6) ;6
add.l d6,a6 ;2
move.w 0000(a4),(a6) ;6
add.l d6,a6 ;2
move.w 0000(a4),(a6) ;6
add.l d6,a6 ;2
move.w 0000(a4),(a6) ;6
add.l d6,a6 ;2

add.l d3,d4 ;2
and.l d2,d4 ;2
lea (a5,d4.w),a4 ;6
dbf d5,.loop8 ;3

77 cycles 8 pixels. 9,6 cycle pr pixel.

Almost 2 times the speed... It can run smooth on a standard a1200! with some work.
dalton
Member
#2 - Posted: 29 Nov 2008 23:26
Reply Quote
you should release a patch for the game =) it looks painfully slow!
sp_
Member
#3 - Posted: 30 Nov 2008 05:54 - Edited
Reply Quote
I am searching for the fastest doomclone on the standard a1200/a500.

Fears used up to 2 longword divs and 2 mulses pr poligon scanline. The divses can be replaced by table-reads, and the muls can be word size.


Now over to more famous games:


Gloom deluxe wallrender innerloop 1 pixel: (The exe contains many different innerloops for transparency,floor etc.)

.loop
move.b (a3,d2.w),d5
move.b (a4,d5.w),(a1) ;shadetable access
addx.l d3,d2
add.l d0,a1
dbf d4,.loop

The c2p I found in the exe was not the fastest possible.


WallRenderer AlienBreed 3D(The exe contains many different innerloops for transparency,floor etc.)

.loop
move.w (a4,d1.w*2),d3 ;Read texture
blt.b .skip
move.w (a5,d3.w*2),d3 ;transparancy txture shadetable??
move.w (a2,d3.w*2),(a6) ;shadetable??
.skip add.w #$1a0,a6
addx.l d2,d1
dbf d4,.loop

This is perhaps not the correct innerloop as it uses 2 shadetable lookups. The add immidiate is slower than register add.

.

Both innerloops use pr pixel shading with a table-lookup pr pixel. A faster aproach would be to use a mip-map renderer with multiple shaded txtures in memory. The shaded textures must be generated on a 256x256 picture

Another speedup could be to render two(or more) horisontal pixels in the same loop using two(or more) interpolation registers.


gloomdeluxe speedup: interpolation for every 4th pixel. 2 interpolations

.smcloop
move.w (a3),d0 ;3
move.b (a4),d0 ;3
move.w 0000(a3),d1 ;3
move.b 0000(a4),d1 ;3
move.w 0000(a3),d2 ;3
move.b 0000(a4),d2 ;3
move.w 0000(a3),d3 ;3
move.b 0000(a4),d3 ;3

move.w d0,(a1) ;3
add.l a0,a1 ;2
move.w d1,(a1) ;3
add.l a0,a1 ;2
move.w d2,(a1) ;3
add.l a0,a1 ;2
move.w d3,(a1) ;3
add.l a0,a1 ;2

addx.l d4,d5 ;2
addx.l d6,d7 ;2

add.w d5,a3 ;2
add.w d7,a4 ;2

subq.l #1,a1 ;2
bne.b .smcloop ;2

56 cycles/ 8 pixels= 7 cycles pr pixel. (gloom's innerloop is 18 cycles pr. pixel.)

The Alienbreed loop would look similar.
.

I think most doomclones on amiga can be doubled in speed with some work...(on a lowend amiga)
sp_
Member
#4 - Posted: 30 Nov 2008 10:15 - Edited
Reply Quote
;Fast Wall render innerloop for amiga 500/a1200 to Copperchunky 12bit
;The loop will plot 32 pixels 4 pixel wide and 8 pixel high.

.loop
move.w (a0),(a4)
move.w (a1),4(a4)
move.w (a2),8(a4)
move.w (a3),12(a4)
add.l a5,a4
REPT 7
move.w 0000(a0),(a4)
move.w 0000(a1),4(a4)
move.w 0000(a2),8(a4)
move.w 0000(a3),12(a4)
add.l a5,a4
ENDR

addx.l d4,d5
addx.l d5,d6
addx.l d0,d1
addx.l d2,d3
add.w d5,a0
add.w d6,a1
add.w d1,a2
add.w d3,a3

subq.l #1,a6
bne.b .loop


Mc68000:
632 cycles/32 pixels. 19,75 cycles pr pixel

Mc68020:
228 cycles/32 pixels. 7,125 cycles pr pixel
sp_
Member
#5 - Posted: 30 Nov 2008 10:31
Reply Quote
On 020+ 6 adds can be removed with the loop under resulting in 7,0625 cycles pr pixel

.loop
move.w (a0),(a4)
move.w (a1),4(a4)
move.w (a2),8(a4)
move.w (a3),12(a4)
move.w 0000(a0),width(a4)
move.w 0000(a1),widht+4(a4)
move.w 0000(a2),widht+6(a4)
move.w 0000(a3),widht*2+8(a4)
move.w 0000(a0),width*2(a4)
move.w 0000(a1),widht*2+4(a4)
move.w 0000(a2),widht*2+6(a4)
move.w 0000(a3),widht*2+8(a4)
...

add.l a5,a5

addx.l d4,d5
addx.l d5,d6
addx.l d0,d1
addx.l d2,d3
add.w d5,a0
add.w d6,a1
add.w d1,a2
add.w d3,a3

subq.l #1,a6
bne.b .loop
Azure
Member
#6 - Posted: 30 Nov 2008 15:23
Reply Quote
I think you are just developing the valuable insight that the people who spend time to squeeze out every bit out of innerloops are usually not the same people who get things done :)
Azure
Member
#7 - Posted: 30 Nov 2008 15:24
Reply Quote
AFAIK Wolfenstein3d used SMC with an unrolled inner loop, very similar to what you are proposing.
sp_
Member
#8 - Posted: 2 Dec 2008 08:29 - Edited
Reply Quote
When The doomclones started to appear on amiga in 1995 it was a revolution. Few people thought it was possible to do it on the standard 14mhz 020. After 1995 most amiga users moved to higher processors, and optimizing for the old hardware wasn't important anymore. Selfmodified code is not new. On the C64 I think most democoders use it, on atari its been used for a long time. The rotozoomer in Chaos Land by VD 1993 use SMC, and probobly some other old amiga 500 intros/demos.

.
I understand the Alienbreed innerloop now. Textures are not 16 bit in memory, but 8 bit with a colormap. This will double the amount of textures, but slow the renderer with one memory move pr. pixel. With only 2meg chip to play with, this was a fast way to get more txtures. The shadetable is 4096*2 bytes long. probobly with 16 or 8 shades. So 4096*2*16 bytes in memory.

move.b (a4,d1.w),d3 ;Read texture
blt.b .skip
move.w (a5,d3.w*2),d3 ;convert to 12 bit
move.w (a2,d3.w*2),(a6) ;shadetable and plot
.skip
add.w #$1a0,a6
addx.l d2,d1
dbf d4,.loop
Crumb
Member
#9 - Posted: 2 Dec 2008 18:24
Reply Quote
sp_
Member
#10 - Posted: 3 Dec 2008 13:26 - Edited
Reply Quote
I have downloaded the alienbreed 3d II sourcecode and its a mess. I tried to search for for innerloops by searching for the addx instruction but ended up finding a horribly slow bytemove to chipmem c2p. The game is very nice in winuae, and runs fullscreen 50 fps, so why bother to it,.? :D

Most of the games you listed here are made for upgraded amigas. I am searching for a fast doomclone on a500 or standard a1200. When Doom was ported to amiga, I remember we had some discussion somewhere on how to improve the speed. Innerloops and c2p. This took place on IRC or in some newsgroups, I don't remember. On 060 the original doom is faster than all it's clones(if I remember correctly)

This is the result of the power of opensource and skilled democoders with years of experience.
Crumb
Member
#11 - Posted: 3 Dec 2008 22:14
Reply Quote
Death Mask, Space Hulk and AmberMoon run on a standard A500. AFAIK Ambermoon was more or less optimized. Perhaps it would worth a look.

Testament requires a standard A1200.


But you are 100% right with Breathless, NemacIV and Genetic Species as these games will require an accelerator to work smoothly.


BTW, hats off for your work optimizing innerloops :-)

Is it possible to write self-modifying code that works correctly with 040/060? have you tried to create self modifying code for 040/060 that fits inside the cache and works fast?
sp_
Member
#12 - Posted: 9 Dec 2008 03:00 - Edited
Reply Quote
On 040/060 selfmodified code won't speedup. The CPU is able to calculate instructions while writing to slow memory (Pipelining). A Mc68030 clocked at 50mhz is able to do 6 cycles for free while writing a longword to fastmem.


.loop
move.l d1,(a0)+
add.l d1,d2
add.l d3,d4
add.l d5,d6
dbf d7,.loop

is as fast as

.loop
move.l d1,(a0)+
dbf d7,.loop


On 060 two instructions can be executed at the same time if they don't share registers. I don't remember how many cycles its possible to get for free after a fastmemwrite, but I assume its at least 6. This meens 12 instructions if pipelined for free.

.loop
move.l d1,(a0)+
add.l d0,d0
add.l d1,d1
add.l d0,d0
add.l d1,d1
add.l d0,d0
add.l d1,d1
add.l d0,d0
add.l d1,d1
add.l d0,d0
add.l d1,d1
add.l d0,d0
add.l d1,d1
...

dbf d7,.loop

.
To optimize on the 040/060 it will help to improve cache hits. If the textures are flipped in memory. (X and Y axis are swapped). A vertical wallrenderer will more likely read cached texture data. Plotting more than one horisontal pixel pr loop as suggested in the optimized loops above will also increase cachehits. Using mip-maps.(smaller textures for smaller poligons) might help too.

The Mc68060 doesn't have divs.l or muls.l built-in so these instructions have to be emulated. A wordsize divs is 32bit / 16 bit and should be enough for calculating interpolations. A muls is much faster than a divs(2 cycles). If you divide by less than 256you can use <<8 fixedpoint and muls.w. number/n = 1/n *number
dalton
Member
#13 - Posted: 9 Dec 2008 08:37
Reply Quote
The main problem with smc on cache-based CPUs is that the instruction cache isn't writeable. To change an instruction you'd have to write it to data cache first which will then update main memory (eventually), and then there's still no detection mechanism to mirror the main memory into instruction cache when main memory is updated. So basically you'd have to flush instruction cache so that's it's all reloaded. Lot's of pain. No gain!

Much more interesting to exploit benefits of cache architecture than trying to make smc work!
Azure
Member
#14 - Posted: 9 Dec 2008 21:44
Reply Quote
but the 040 and 060 have a writeback cache. With good cache management you should not have a lot of waitstates after writing.
sp_
Member
#15 - Posted: 11 Dec 2008 16:37 - Edited
Reply Quote
The 060 has a 16-byte cacheline.

I assume if you render pixels horisontally. the copybackcache will collect 16 bytes in the cache before the data is pushed to memory. This meens that you get a waitstate for every 16 pixel only.(when the cashline is moved to memory.)

a1 must be alligned to a 16byte boundery
.loop16

move.b (a0,d0.w),(a1)+
swap d0
move.b (a0,d1.w),(a1)+
swap d1
move.b (a0,d2.w),(a1)+
swap d2
move.b (a0,d3.w),(a1)+
swap d3
move.b (a0,d4.w),(a1)+
swap d4
move.b (a0,d5.w),(a1)+
swap d5
move.b (a0,d6.w),(a1)+
swap d6
move.b (a0,d7.w),(a1)+
swap d7
move.b (a0,d0.w),(a1)+
move.b (a0,d1.w),(a1)+
move.b (a0,d2.w),(a1)+
move.b (a0,d3.w),(a1)+
move.b (a0,d4.w),(a1)+
move.b (a0,d5.w),(a1)+
move.b (a0,d6.w),(a1)+
move.b (a0,d7.w),(a1)+ ;The last byte of the cacheline is filled and 16 bytes are moved to memory

...
interpolationcode for 16 pixels is placed here. Probobly 0 cycles, since the bus is busy writing to memory.
...
sp_
Member
#16 - Posted: 13 Dec 2008 17:48 - Edited
Reply Quote
I found a wolfenstein port for the Atari ST.
The author claims the game runs in 15fps. on 8mhz Mc68000 2meg ram

If any Atari coders could disassemble it and give me the innerloop it would be interesting..

http://freenet-homepage.de/ray.tscc/wolf3d.htm
sp_
Member
#17 - Posted: 30 Apr 2011 20:00 - Edited
Reply Quote
I am planning to patch AB3dIhy I to use truecolor mode by modyfying the shadetables..

Every pixel is rendered through a 8bit shadetable...

move.b (a0,d0.w),(a1)+ ....

How about expanding the shadetables to 24bit :D

To give the game a fresh look..

What do you think?
Blueberry
Member
#18 - Posted: 2 May 2011 12:41
Reply Quote
I think it is a waste of good processor cycles (and memory) to have a certain cache miss on every blend operation, when a full alpha blend can be computed in less time. :)

 

  Please log in to comment

  

  

  

 

A.D.A. Amiga Demoscene Archive, Version 3.0