|
Author |
Message |
sp_
Member |
TP94 3rd place. Motion by bomb use a 3x2 copperchunky in the last doom effect. This routine became the game called "Fears" in 1995 FEARS is the first doomclone for amiga. developed by Bomb.
Here is a Youtube clip running the game engine on Cd32. (amiga1200 14mhz 2meg chip)
http://www.youtube.com/watch?v=mbnif3G1wyU
Innerloop:
.loop
move.w d1,d4 ;2
and.w d2.d4 ;2
move.w (a4,d4.w),(a6) ;7
adda.w d0,a6 ;2
add.w d3,d1 ;2
dbf d5,.loop ;3
18 cycles + agu stall. pr pixel
The doom sequence in "Motion" (TP 1994 3rd place demo) innerloop by the same group:
.loop
move.l d4,d1 ;2
swap d1 ;4
and.l d2,d1 ;2
move.w (a4,d1.w),(a6) ;7
add.l d3,d4 ;2
lea $184(a6),a6 ;4
dbf d5,.loop ;3
24 cycles + agu stall. pr pixel
SP SMC Optimized:
.loop8
move.w (a4),(a6) ;6
add.l d6,a6 ;2
move.w 0000(a4),(a6) ;6
add.l d6,a6 ;2
move.w 0000(a4),(a6) ;6
add.l d6,a6 ;2
move.w 0000(a4),(a6) ;6
add.l d6,a6 ;2
move.w 0000(a4),(a6) ;6
add.l d6,a6 ;2
move.w 0000(a4),(a6) ;6
add.l d6,a6 ;2
move.w 0000(a4),(a6) ;6
add.l d6,a6 ;2
move.w 0000(a4),(a6) ;6
add.l d6,a6 ;2
add.l d3,d4 ;2
and.l d2,d4 ;2
lea (a5,d4.w),a4 ;6
dbf d5,.loop8 ;3
77 cycles 8 pixels. 9,6 cycle pr pixel.
Almost 2 times the speed... It can run smooth on a standard a1200! with some work.
|
dalton
Member |
you should release a patch for the game =) it looks painfully slow!
|
sp_
Member |
I am searching for the fastest doomclone on the standard a1200/a500.
Fears used up to 2 longword divs and 2 mulses pr poligon scanline. The divses can be replaced by table-reads, and the muls can be word size.
Now over to more famous games:
Gloom deluxe wallrender innerloop 1 pixel: (The exe contains many different innerloops for transparency,floor etc.)
.loop
move.b (a3,d2.w),d5
move.b (a4,d5.w),(a1) ;shadetable access
addx.l d3,d2
add.l d0,a1
dbf d4,.loop
The c2p I found in the exe was not the fastest possible.
WallRenderer AlienBreed 3D(The exe contains many different innerloops for transparency,floor etc.)
.loop
move.w (a4,d1.w*2),d3 ;Read texture
blt.b .skip
move.w (a5,d3.w*2),d3 ;transparancy txture shadetable??
move.w (a2,d3.w*2),(a6) ;shadetable??
.skip add.w #$1a0,a6
addx.l d2,d1
dbf d4,.loop
This is perhaps not the correct innerloop as it uses 2 shadetable lookups. The add immidiate is slower than register add.
.
Both innerloops use pr pixel shading with a table-lookup pr pixel. A faster aproach would be to use a mip-map renderer with multiple shaded txtures in memory. The shaded textures must be generated on a 256x256 picture
Another speedup could be to render two(or more) horisontal pixels in the same loop using two(or more) interpolation registers.
gloomdeluxe speedup: interpolation for every 4th pixel. 2 interpolations
.smcloop
move.w (a3),d0 ;3
move.b (a4),d0 ;3
move.w 0000(a3),d1 ;3
move.b 0000(a4),d1 ;3
move.w 0000(a3),d2 ;3
move.b 0000(a4),d2 ;3
move.w 0000(a3),d3 ;3
move.b 0000(a4),d3 ;3
move.w d0,(a1) ;3
add.l a0,a1 ;2
move.w d1,(a1) ;3
add.l a0,a1 ;2
move.w d2,(a1) ;3
add.l a0,a1 ;2
move.w d3,(a1) ;3
add.l a0,a1 ;2
addx.l d4,d5 ;2
addx.l d6,d7 ;2
add.w d5,a3 ;2
add.w d7,a4 ;2
subq.l #1,a1 ;2
bne.b .smcloop ;2
56 cycles/ 8 pixels= 7 cycles pr pixel. (gloom's innerloop is 18 cycles pr. pixel.)
The Alienbreed loop would look similar.
.
I think most doomclones on amiga can be doubled in speed with some work...(on a lowend amiga)
|
sp_
Member |
;Fast Wall render innerloop for amiga 500/a1200 to Copperchunky 12bit
;The loop will plot 32 pixels 4 pixel wide and 8 pixel high.
.loop
move.w (a0),(a4)
move.w (a1),4(a4)
move.w (a2),8(a4)
move.w (a3),12(a4)
add.l a5,a4
REPT 7
move.w 0000(a0),(a4)
move.w 0000(a1),4(a4)
move.w 0000(a2),8(a4)
move.w 0000(a3),12(a4)
add.l a5,a4
ENDR
addx.l d4,d5
addx.l d5,d6
addx.l d0,d1
addx.l d2,d3
add.w d5,a0
add.w d6,a1
add.w d1,a2
add.w d3,a3
subq.l #1,a6
bne.b .loop
Mc68000:
632 cycles/32 pixels. 19,75 cycles pr pixel
Mc68020:
228 cycles/32 pixels. 7,125 cycles pr pixel
|
sp_
Member |
On 020+ 6 adds can be removed with the loop under resulting in 7,0625 cycles pr pixel
.loop
move.w (a0),(a4)
move.w (a1),4(a4)
move.w (a2),8(a4)
move.w (a3),12(a4)
move.w 0000(a0),width(a4)
move.w 0000(a1),widht+4(a4)
move.w 0000(a2),widht+6(a4)
move.w 0000(a3),widht*2+8(a4)
move.w 0000(a0),width*2(a4)
move.w 0000(a1),widht*2+4(a4)
move.w 0000(a2),widht*2+6(a4)
move.w 0000(a3),widht*2+8(a4)
...
add.l a5,a5
addx.l d4,d5
addx.l d5,d6
addx.l d0,d1
addx.l d2,d3
add.w d5,a0
add.w d6,a1
add.w d1,a2
add.w d3,a3
subq.l #1,a6
bne.b .loop
|
Azure
Member |
I think you are just developing the valuable insight that the people who spend time to squeeze out every bit out of innerloops are usually not the same people who get things done :)
|
Azure
Member |
AFAIK Wolfenstein3d used SMC with an unrolled inner loop, very similar to what you are proposing.
|
sp_
Member |
When The doomclones started to appear on amiga in 1995 it was a revolution. Few people thought it was possible to do it on the standard 14mhz 020. After 1995 most amiga users moved to higher processors, and optimizing for the old hardware wasn't important anymore. Selfmodified code is not new. On the C64 I think most democoders use it, on atari its been used for a long time. The rotozoomer in Chaos Land by VD 1993 use SMC, and probobly some other old amiga 500 intros/demos.
.
I understand the Alienbreed innerloop now. Textures are not 16 bit in memory, but 8 bit with a colormap. This will double the amount of textures, but slow the renderer with one memory move pr. pixel. With only 2meg chip to play with, this was a fast way to get more txtures. The shadetable is 4096*2 bytes long. probobly with 16 or 8 shades. So 4096*2*16 bytes in memory.
move.b (a4,d1.w),d3 ;Read texture
blt.b .skip
move.w (a5,d3.w*2),d3 ;convert to 12 bit
move.w (a2,d3.w*2),(a6) ;shadetable and plot
.skip
add.w #$1a0,a6
addx.l d2,d1
dbf d4,.loop
|
Crumb
Member |
|
sp_
Member |
I have downloaded the alienbreed 3d II sourcecode and its a mess. I tried to search for for innerloops by searching for the addx instruction but ended up finding a horribly slow bytemove to chipmem c2p. The game is very nice in winuae, and runs fullscreen 50 fps, so why bother to it,.? :D
Most of the games you listed here are made for upgraded amigas. I am searching for a fast doomclone on a500 or standard a1200. When Doom was ported to amiga, I remember we had some discussion somewhere on how to improve the speed. Innerloops and c2p. This took place on IRC or in some newsgroups, I don't remember. On 060 the original doom is faster than all it's clones(if I remember correctly)
This is the result of the power of opensource and skilled democoders with years of experience.
|
Crumb
Member |
Death Mask, Space Hulk and AmberMoon run on a standard A500. AFAIK Ambermoon was more or less optimized. Perhaps it would worth a look.
Testament requires a standard A1200.
But you are 100% right with Breathless, NemacIV and Genetic Species as these games will require an accelerator to work smoothly.
BTW, hats off for your work optimizing innerloops :-)
Is it possible to write self-modifying code that works correctly with 040/060? have you tried to create self modifying code for 040/060 that fits inside the cache and works fast?
|
sp_
Member |
On 040/060 selfmodified code won't speedup. The CPU is able to calculate instructions while writing to slow memory (Pipelining). A Mc68030 clocked at 50mhz is able to do 6 cycles for free while writing a longword to fastmem.
.loop
move.l d1,(a0)+
add.l d1,d2
add.l d3,d4
add.l d5,d6
dbf d7,.loop
is as fast as
.loop
move.l d1,(a0)+
dbf d7,.loop
On 060 two instructions can be executed at the same time if they don't share registers. I don't remember how many cycles its possible to get for free after a fastmemwrite, but I assume its at least 6. This meens 12 instructions if pipelined for free.
.loop
move.l d1,(a0)+
add.l d0,d0
add.l d1,d1
add.l d0,d0
add.l d1,d1
add.l d0,d0
add.l d1,d1
add.l d0,d0
add.l d1,d1
add.l d0,d0
add.l d1,d1
add.l d0,d0
add.l d1,d1
...
dbf d7,.loop
.
To optimize on the 040/060 it will help to improve cache hits. If the textures are flipped in memory. (X and Y axis are swapped). A vertical wallrenderer will more likely read cached texture data. Plotting more than one horisontal pixel pr loop as suggested in the optimized loops above will also increase cachehits. Using mip-maps.(smaller textures for smaller poligons) might help too.
The Mc68060 doesn't have divs.l or muls.l built-in so these instructions have to be emulated. A wordsize divs is 32bit / 16 bit and should be enough for calculating interpolations. A muls is much faster than a divs(2 cycles). If you divide by less than 256you can use <<8 fixedpoint and muls.w. number/n = 1/n *number
|
dalton
Member |
The main problem with smc on cache-based CPUs is that the instruction cache isn't writeable. To change an instruction you'd have to write it to data cache first which will then update main memory (eventually), and then there's still no detection mechanism to mirror the main memory into instruction cache when main memory is updated. So basically you'd have to flush instruction cache so that's it's all reloaded. Lot's of pain. No gain!
Much more interesting to exploit benefits of cache architecture than trying to make smc work!
|
Azure
Member |
but the 040 and 060 have a writeback cache. With good cache management you should not have a lot of waitstates after writing.
|
sp_
Member |
The 060 has a 16-byte cacheline.
I assume if you render pixels horisontally. the copybackcache will collect 16 bytes in the cache before the data is pushed to memory. This meens that you get a waitstate for every 16 pixel only.(when the cashline is moved to memory.)
a1 must be alligned to a 16byte boundery
.loop16
move.b (a0,d0.w),(a1)+
swap d0
move.b (a0,d1.w),(a1)+
swap d1
move.b (a0,d2.w),(a1)+
swap d2
move.b (a0,d3.w),(a1)+
swap d3
move.b (a0,d4.w),(a1)+
swap d4
move.b (a0,d5.w),(a1)+
swap d5
move.b (a0,d6.w),(a1)+
swap d6
move.b (a0,d7.w),(a1)+
swap d7
move.b (a0,d0.w),(a1)+
move.b (a0,d1.w),(a1)+
move.b (a0,d2.w),(a1)+
move.b (a0,d3.w),(a1)+
move.b (a0,d4.w),(a1)+
move.b (a0,d5.w),(a1)+
move.b (a0,d6.w),(a1)+
move.b (a0,d7.w),(a1)+ ;The last byte of the cacheline is filled and 16 bytes are moved to memory
...
interpolationcode for 16 pixels is placed here. Probobly 0 cycles, since the bus is busy writing to memory.
...
|
sp_
Member |
I found a wolfenstein port for the Atari ST.
The author claims the game runs in 15fps. on 8mhz Mc68000 2meg ram
If any Atari coders could disassemble it and give me the innerloop it would be interesting..
http://freenet-homepage.de/ray.tscc/wolf3d.htm
|
sp_
Member |
I am planning to patch AB3dIhy I to use truecolor mode by modyfying the shadetables..
Every pixel is rendered through a 8bit shadetable...
move.b (a0,d0.w),(a1)+ ....
How about expanding the shadetables to 24bit :D
To give the game a fresh look..
What do you think?
|
Blueberry
Member |
I think it is a waste of good processor cycles (and memory) to have a certain cache miss on every blend operation, when a full alpha blend can be computed in less time. :)
|
|
|