|
Author |
Message |
d0DgE
Member |
Hi guys! It's this time of the year that I usually start looking again in my collection of primitives with the aim of optimising stuff. This time I thought about my pixel drawing routine. ATM I do this solely with the CPU, no Blitter usage involved and as "wasted years" (the twist-ribbon) showed, it barely ran on the A500. Those were 2x160 pixels to outline the ribbon which then was filled with the Blitter onto the screen. Of course in the Twist-routine I used an in-line modified version of the following code to avoid unnecessary subroutine branches. This is the 1-Bitplane version.... WORD equ 15 drawPlanarPixel: ; drawing plane => a0 ...could be a buffer, too ; X => d0 ; Y => d1 ; SCRAP => d2,d3 movem.l d0-d3/a0,-(a7)
; manage the Y position move.w d1,d2 ; copy y lsl.w #4,d1 ; multiply by 40 lsl.w #3,d2 add.w d1,d1 add.w d2,d1 add.w d1,a0 ; enter y pos first
; manage the X position move.w d0,d2 ; copy x lsr.w #3,d0 ; divide by 8 to get the hardposition and.w #$000f,d2 ; mask the 4 lower bits for 0-15 softposition btst #0,d0 ; is the hardposition an odd value ? beq.s .even ; nope ...skip the -1 action subq.w #1,d0 ; it's odd ...sub 1 to keep even steps (68000!) .even: sub.w #WORD,d2 neg.w d2 add.w d0,a0 ; move to X hardposition move.w (a0),d3 ; take current screendata word aligned bset d2,d3 ; set the softposition pixel move.w d3,(a0) ; put back the modified word
movem.l (a7)+,d0-d3/a0 rts
Please note, that it is done for convenience so that I just provide decimal X,Y coordinates and a buffer to write to and fire the thing to get my pixel. I'd very much would appreciate any speed-up/optimising tips on this one. I'm not exactly fluent in all the available instructions - especially regarding bitfields and such stuff - so there must be ways to do this operation more elegantly and efficiantly. Is it a good idea to let the Blitter do this work ? If yes, is there a guide or an example to peek into regarding pixel plotting with the Blitter ? So far the Line-Drawing Mode gave me a headache -_- Thx in advance
|
ZEROblue
Member |
By extending your bitplanes to f.ex 64 bytes wide for faster addressing you can do: moveq #-$80, d2 ror.b d0, d2 lsr.w #3, d0 lsl.w #6, d1 add.w d0, d1 or.b d2, (a0, d1.w)
|
dalton
Member |
If you're referring to screens 11-13 here at ADA, I'd suggest not plotting any pixels at all. Put any static gfx in odd bitplanes, and then put a triangle in the even. The triangle should be 1 pixel wide on row 1, and extend it's size by one pixel on each side downwards. Then you create a copper list that writes to even bitplane modulo on each scanline. Then simply write the modulo that corresponds to a certain row in the triangle to draw a horisontal line of desired width. Colors can of course also be set using the copper list.
|
d0DgE
Member |
No, dalton, the ribbon was just an example given on what I used the pixel routine. The colours were set using the copper ;) - it was a 1 bpl effect.
There are of course a lot more occations you can use a fast pixel routine in.
ZeroBlue:
Interesting proposition. I'll try some of this. Thanks :)
|
dalton
Member |
I suggest something like this for a one-bitplane plot. In principle it's the same as the one you did, only it uses more shortcuts for setting bits and addressing... ; d0/d1 = x/y ; a0 = bitplane pointer
asl.w #6, d1 ; assuming bitplane is 64 bytes wide lea (a0,d1.w), a0
move.w d0, d1 ; copy x moveq #%10000000, d2 ; this is the pixel =) and.w #7, d0 ; mask out bit position asr.w #3, d1 ; get byte offset lsr.b d0, d2 ; shift pixel on position or.b d2, (a0,d1.w) ; put in place
There is good tutorial here: http://www.modermodemet.se/dalton/tut/DOTS.TXT(it's in swedish, but code is still code I guess)
|
Vektor
Member |
Another option. Who calculates the cycles per pixel?
; a0 = table with pixels to be plotted ; a1 = pre calculated screen multiply table ; a2 = pre calculated x division table ; a3 = screenpointer
lea Position_table(pc),a0 lea Shift_Table_x(pc),a1 lea Mulu_Table_y(pc),a3 lea screen(pc),a2 moveq #0,d3 moveq #num_of_pixels-1,d7
.loop
move.w (a0)+,d0 move.w (a0)+,d1
add.w d1,d1 ;y=y*2 to get an index move.w (a2,d1.w),d1 ;screen multiplication tables with multiply value
move.b (a1,d0.w),d3 ;add x-word add.w d3,d1 ;add x word position
not.w d0 ;shift bit bset d0,(a3,d1.W) ;plot the pixel
dbf d7,.loop rts
|
Vektor
Member |
Found a typing error:
bset d0,(a3,d1.W) should be bset d0,(a2,d1.W)
|
d0DgE
Member |
Nice hints Dalton, ZeroBlue. I've got it implemented and adapted. Also thanks Vektor but I need a rather flexible on-the-fly multi-purpose plotting routy. Maybe because I've become a custom to higher language methods like drawCircle(); ;)
Edit:
... the Y multiplication table trick from Vektor is really neat :D
|
dalton
Member |
I see now that I posted basically the same routine as ZeroBlue, only his was better =) Should read more carefully I guess...
|
coyote
Member |
I'm sure you guys noticed that dalton & ZeroBlue wrote routines that won't work on 68000 because of odd address accesses. (probably doesn't matter anyway...)
|
britelite
Member |
Umm, I can't see any reading or writing .w or .l at odd addresses...
|
coyote
Member |
Yeah britelite. You are right. Sorry, I must have still been sleeping... Mea culpa. O:-} My apologies to dalton and ZEROBlue.
|
Vektor
Member |
@d0DgE: Correct this routine plots a "simple" predefined array now but it can be used icw eg Bresenham to create a quite fast drawcircle routine. If you're interested I must have it somewhere in my old amiga sourcecodes.
|
d0DgE
Member |
Britez0r is quite right. ZeroBlue & Dalton's approach workes fine on the 68000. The only downside in the long run is the "or" itself, which makes it useful for separate bitplane actions only.
@Vektor: of course I'm interested. Send it to "dodge[ät]rowdyclub[döt].de" whenever you like :)
|
Vektor
Member |
Found it, this was the main plotting algo. I justed checked the entire source code with UAE, with A500 speed it runs in about 1/5 of a frame with a 260 pix wide circle.
@doDgE: I will email you the entire sourcecode
* a0 = x * a1 = y * a2 = screen * d0 = radius
Draw_circle: moveq #3,d5 moveq #6,d6
moveq #0,d1 ;x=0 move.w d0,d2
subq.w #1,d2 ;d=r-1 .loop: tst.w d2 bpl.b .no_ydec
subq.w #1,d0 ;y=y-1
add.w d0,d2 ;d=d+y
.no_ydec: move.w a1,d3 ;y sub.w d0,d3 ;y-r
lsl.w d6,d3 ;(y-r)*schermbreedte lea (a2,d3.w),a3 ;screen pointer + y-offset
move.w a0,d3 ;x add.w d1,d3 ;x+int x
move.w d3,d4 lsr.w d5,d3 not.w d4 bset d4,(a3,d3.w)
move.w a0,d3 sub.w d1,d3
move.w d3,d4 lsr.w d5,d3 not.w d4 bset d4,(a3,d3.w)
move.w a1,d3 sub.w d1,d3 lsl.w d6,d3 lea (a2,d3.w),a3
move.w a0,d3 add.w d0,d3
move.w d3,d4 lsr.w d5,d3 not.w d4 bset d4,(a3,d3.w)
move.w a0,d3 sub.w d0,d3
move.w d3,d4 lsr.w d5,d3 not.w d4 bset d4,(a3,d3.w) move.w a1,d3 add.w d1,d3
lsl.w d6,d3 lea (a2,d3.w),a3
move.w a0,d3 add.w d0,d3
move.w d3,d4 lsr.w d5,d3 not.w d4 bset d4,(a3,d3.w)
move.w a0,d3 sub.w d0,d3
move.w d3,d4 lsr.w d5,d3 not.w d4 bset d4,(a3,d3.w)
move.w a1,d3 add.w d0,d3
lsl.w d6,d3 lea (a2,d3.w),a3
move.w a0,d3 add.w d1,d3
move.w d3,d4 lsr.w d5,d3 not.w d4 bset d4,(a3,d3.w) move.w a0,d3 sub.w d1,d3
move.w d3,d4 lsr.w d5,d3 not.w d4 bset d4,(a3,d3.w) sub.w d1,d2
subq.w #2,d2 addq.w #1,d1
cmp.w d0,d1 bls.w .loop rts
|
z5_
Member |
go go go, dodge! :)
@Vektor: any interest in rejoining the amigascene and code some stuff again? would be cool!
|
d0DgE
Member |
...by now it finally occured to me that one can create a quite convenient MACRO for this pixel plotting code ... D'OH
well, you'll stop learning
|
Vektor
Member |
@z5_, If you have interesting idea's I'm always open to code / review some things but don't expect too much!
|
Azure
Member |
It has been a long time since I did this, but this looks awefully wasteful to me.
Is this routine supposed to be optimized for 68000 or 68060? I dug around my old backups and found a 3d dotrotator I coded once. I don't think I have ever used it anywhere. It uses a similar approach as the one Mr. Pet did in roots, but may be slightly more optimized.
The innerloop performs 3D rotation, transformation into the screen space (perspectve) and pixel plotting.
.bigloop
REPT 2 move.l (a3)+,d3 move.l (a3)+,d2 move.l (a0,d0.w*4),d3 ;a0-a2 precalculated tables with add.l (a1,d2.w*4),d3 ;M-entries. 512 longwords each add.l (a2,d5.w*4),d3 ;d0=00000000SyyyyyyySzzzzzzzSxxxxxxx move.l (a4,d3.w*4),d1 ;Perspective for x (SzzzzzzzSxxxxxxx) bfset (a6){d4:1} ;setpixel (a6=planepointer) lsr.l #8,d3 ;12 free cycles... swap d2 swap d0 add.l (a5,d3.w*4),d1 ;Perspective for y (SyyyyyyySzzzzzzz) ;d1=Dotadress (pixnr)+planeoffset for ;colors ;d1 highword=0 move.l (a3)+,d5 move.l (a0,d3.w*4),d3 ;a0-a2 precalculated tables with add.l (a1,d5.w*4),d3 ;M-entries. 512 longwords each add.l (a2,d2.w*4),d3 ;d0=00000000SyyyyyyySzzzzzzzSxxxxxxx move.l (a4,d3.w*4),d4 ;Perspective for x (SzzzzzzzSxxxxxxx) bfset (a6){d1:1} ;setpixel (a6=planepointer) lsr.l #8,d3 swap d5 add.l (a5,d3.w*4),d4 ;Perspective for y (SyyyyyyySzzzzzzz) ;d1=Dotadress (pixnr)+planeoffset for ;colors ;d1 highword=0 ENDR
|
Vektor
Member |
@Azure: my routine is 68000 based. I looked at yours and except for the perspective precalc with the z coordinates in the upper word and the bfset (030+?) the aproach is basically the same, precalc everything, use the coordinates as index (which can be done within the instruction on 020+) The only thing I don't get are your first (three) longword moves, the third overwrites the first?
|
Azure
Member |
...the first move should probably be to D0. I was not able to check whether the sourcecode was functional.
bfset is very neat, as it allows to avoid separate shifting to calculate the address offset. There is really just a single instrution responsible for the plotting in this routine, the remaining instructions are for 3d calculations.
|
Rebb
Member |
My version of the pixel plotter. Already got some good tips here (removing the mulu), but as this is my first plotter i guess there's still lot of room for improvement. plot: ;takes d0=color,d1=x,d2=y,a0=bplane
findy: ; multiply y with 40 to get add factor for bitplane
move.w d2,d3 lsl.w #4,d2 lsl.w #3,d3 add.w d2,d2 add.w d3,d2 add.w d2,a0 checkplane: btst.l #0,d0 ; testbit on colorvalue to get planes to plot beq plane2 jsr pixset
plane2: lea bplane,a0 ; bitplane address to a0 add.l d2,a0 ; start address for correct line add.l #10240,a0 ; address of plane btst.l #1,d0 beq plane3 jsr pixset plane3: lea bplane,a0 add.l d2,a0 ; start address for correct line add.l #20480,a0 btst.l #2,d0 beq plane4 jsr pixset
plane4:
lea bplane,a0 add.l d2,a0 add.l #30720,a0 btst.l #3,d0 beq plane5 jsr pixset
plane5:
lea bplane,a0 add.l d2,a0 ; start address for correct line add.l #40960,a0 btst.l #4,d0 beq out jsr pixset
out: rts
pixset: move.l d1,d4 ; copy x to d4 move.l d1,d5 ; and d5 move.l d1,d3 ; and d3 lsr.l #3,d3 ; divide with 8 to get number of byte add.l d3,a0 ; get to the byte we are changing
asl.l #3,d3 ; How many times did x fit in 8? cmp #0,d3 ; If zero, x is directly the bits to set beq nolla ; sub.l d3,d4 ; Substract multiply of 8 from original x move.l d4,d5 ; to get pixel number nolla: move.l #7,d6 ; substract 7 from pixel number sub d5,d6 ; to get right bit bset d6,(a0) ; set the "d6 th bit" on a0
rts edit: What is a good way to "time out" routines like this, when optimising?
|
pmc
Member |
Rebb: edit: What is a good way to "time out" routines like this, when optimising? Do you mean: what's a good way to see how long the routine takes to execute? If so, then seeing how many raster lines it takes will give a good indication. To do that, before your routine wait for a screen position and change the background colour. At the end of your routine, change the background colour to what it was before you changed it at the start of your routine. The number of coloured lines you can see is now the number of raster lines your routine took. This code will do that for you: .wt_line: cmp.b #160,$dff006 bne.s .wt_line move.w #$0fff,$dff180
<your routine here>
move.w #$0000,$dff180
|
Vektor
Member |
To time a routine the easiest way is just to write a color change (#0 or #$0fff) to the dff180. You will see how many raster lines your routine takes... (I see now PMC has already answered this one..)
Maybe this gives some ideas!
plot: ;takes d0=color,d1=x,d2=y,a0=bplane
lea bplane(pc),a0 lea screenpointers(pc),a1 lea y_mulitply(pc),a2 lea x_words(pc),a3
lea dot_to_plot_table(pc),a5
moveq #0,d0 moveq #0,d1 moveq #0,d2 moveq #0,d3 moveq #0,d4 moveq #0,d7
move.w (a5)+,d7 ;number of pixels to be plot
.loop movem.w (a5)+,d0-d2;
add.w d0,d0 ;(x2 ) add.w d0,d0 ;(twice x2 makes x4 to make an index) move.l (a1,d0),a0 ;add the right value to the bplane pointer
add.w d2,d2 ;y=y*2 to get an index move.w (a2,d2.w),d2 ;screen multiplication tables with multiply value
move.b (a3,d1.w),d3 ;add x-word add.w d3,d2 ;add x word position to the y position
not.w d0 ;shift to ensure the right bit is set bset d0,(a3,d2.W) ;plot the pixel
dbf d7,.loop
rts
screenpointers: dc.l bplane dc.l bplane+10240 dc.l bplane+2*10240 dc.l bplane+3*10240 dc.l bplane+4*10240
x_words: dc.b 0,0,0,0,0,0,0,0 dc.b 1,1,1,1,1,1,1,1 dc.b 2,2,2,2,2,2,2,2 dc.b 3,3,3,3,3,3,3,3 dc.b 4,4,4,4,4,4,4,4 dc.b etc
y_multiply: dc.w 0 dc.w screen_width ; in bytes dc.w 1*screen_width ; in bytes dc.w 2*screen_width ; in bytes dc.w 3*screen_width ; in bytes dc.w 4*screen_width ; in bytes dc.w etc
dot_to_plot_table: dc.w 4-1; number of pixels (minus 1) to be plot dc.w 0,0,200; plane,x,y dc.w 1, 200,200 dc.w 2, 200,0 dc.w 3,0,0
bplane: dcb.b 5*10240,0
|
Kalms
Member |
Rebb:
your routine will invoke "pixset" multiple times. Most of the code in "pixset" will give the exact same result every time you invoke it. Thus those calculations can be moved out of the "pixset" routine.
In order to get some simple metrics, consider these:
* how many instructions do you execute when plotting a pixel with color 1? * how many instructions do you execute when plotting a pixel with color 31?
Pick one or several metrics of the kind above, decide which are important to you, and try to improve those metrics.
|
ZEROblue
Member |
Make sure you have consistent DMA activity across the lines you are measuring over using the above method, or the result might be a completely wrong indication.
A high amount of DMA activity (many bitplanes, sprites, audio, blitter running etc.) can halt the CPU severely, and going from 200 to 100 colored lines doesn't necessarily mean your routine is now twice as fast, and so this may be a very inexact method.
However if you're just looking to see if your routine simply becomes faster or slower it will work fine. Typically you would then find f.ex how many dots you can plot in the context of your demo part and still maintain the same frame rate.
|
noname
Member |
I would generally try to avoid the use of a setPixel function by all means. In this respect, Azure's post has leading for me. Also, macros might come in handy to inline frequently used subroutines.
|
d0DgE
Member |
Exactly. The massive amounts of subroutine branches (bsr setPixel) during a drawCircle for example really slowed down my first circle draw routines. JSR is even slower. So building a tiny MACRO with the very essential lines of setting a bit at an X | Y position is a really neat thing to implement. @Rebb: By scanning through your code example it occured to me, that you only ask for planes to be drawn into, not those where you might clear a bit in order to set the right colour ( 0 - 31). That could result in less available colours or even distort the complete screen result. As you were showing your >1 Bitplane approach I can still give my version for a 5-Bitplane pixeldraw. Please note: this is still the totally bloated- slow as hell version with no improvements implemented that this nice thread offered. I used this thing to pre-render the pixelplasma-animation shown in Wasted Years' end screen and it is the very reason I had to build a "werkkzeug" loaderbar :/ This routine is very convenient when it comes to the colour values. You just drop f.e. "17" to _drpColour, give the coordinates and screen and fire the damn thing, but it is in no way "real-time" fit word equ 15 _drpPlaneSize: dc.l plsize cnop 0,4 _drpColours: dc.b 0 ; 180 : 00 dc.b %00000001 ; 182 : 01 dc.b %00000010 ; 184 : 02 dc.b %00000011 ; 186 : 03 dc.b %00000100 ; 188 : 04 dc.b %00000101 ; 18a : 05 dc.b %00000110 ; 18c : 06 dc.b %00000111 ; 18e : 07 dc.b %00001000 ; 190 : 08 dc.b %00001001 ; 192 : 09 dc.b %00001010 ; 194 : 10 dc.b %00001011 ; 196 : 11 dc.b %00001100 ; 198 : 12 dc.b %00001101 ; 19a : 13 dc.b %00001110 ; 19c : 14 dc.b %00001111 ; 19e : 15 dc.b %00010000 ; 1a0 : 16 dc.b %00010001 ; 1a2 : 17 dc.b %00010010 ; 1a4 : 18 dc.b %00010011 ; 1a6 : 19 dc.b %00010100 ; 1a8 : 20 dc.b %00010101 ; 1aa : 21 dc.b %00010110 ; 1ac : 22 dc.b %00010111 ; 1ae : 23 dc.b %00011000 ; 1b0 : 24 dc.b %00011001 ; 1b2 : 25 dc.b %00011010 ; 1b4 : 26 dc.b %00011011 ; 1b6 : 27 dc.b %00011100 ; 1b8 : 28 dc.b %00011101 ; 1ba : 29 dc.b %00011110 ; 1bc : 30 dc.b %00011111 ; 1be : 31 _drpColour: dc.b 0 cnop 0,4 drawRealPixel: ; drawing plane => a0 ; X => d0 ; Y => d1 ; colour offset => d3 ; SCRAP => d2-d5 movem.l d0-d7/a0/a1,-(a7) lea _drpColours(pc),a1 move.l _drpPlaneSize(pc),d6
moveq #0,d4 tst.b d3 ; is there a colour? bne.s .ok movem.l (a7)+,d0-d7/a0/a1 rts .ok: move.b (a1,d3),d4 ; colour byte moveq #0,d5 ; first bit in colour byte ; move to position move.w d1,d2 ; copy y lsl.w #4,d1 ; multiply by 40 lsl.w #3,d2 add.w d1,d1 add.w d2,d1 add.w d1,a0 ; y pos first ; manage the X position move.w d0,d2 ; copy x lsr.w #3,d0 ; divide by 8 to get the hardposition and.w #$000f,d2 ; mask the 4 lower bits for 0-15 softposition btst #0,d0 ; is the hardposition an odd value ? beq.s .even ; nope ...skip the -1 action subq.w #1,d0 ; it's odd ...sub 1 to keep even steps (A500!) .even: sub.w #word,d2 neg.w d2 add.w d0,a0 ; move to hardposition
moveq #planes-1,d7 .drawlp: move.w (a0),d3 ; take current screendata word sized btst d5,d4 ; bit 0 or 1 (i.e. clear or draw) beq.s .clear bset d2,d3 ; colour bit is lit -> set it lit in the data bra.s .cont .clear: bclr d2,d3 ; colour bit is clr -> clear it in the data .cont: move.w d3,(a0) ; write back the modified data addq.b #1,d5 ; prepare next colour bit to test add.l d6,a0 ; jump to next plane dbf d7,.drawlp .end: movem.l (a7)+,d0-d7/a0/a1 rts
|
sp_
Member |
Azure's example replaces a matrix multiplication and perspective transformation with a set of small lookuptables.
9 multiplications and 2 divisions per pixel removed with Dynamic programming.
In CodeTherory Matrix approximations have proven not to give the optimal codes. Dynamic programming might...
There is a faster way to solve this problem on the a500. If I ever finnish my a500 demo I will show you. ;)
|
Azure
Member |
sp:
On A500 you can simply hardcore all offsets to the lookup tables and completely unroll the loop. Graham has done something like this on C64 long long ago...
|
|
|