|  | 
| Author | Message |  
	| d0DgE Member
 | 
		Hi guys! It's this time of the year that I usually start looking again in my collection of primitives with the aim of optimising stuff. This time I thought about my pixel drawing routine. ATM I do this solely with the CPU, no Blitter usage involved and as "wasted years" (the twist-ribbon) showed, it barely ran on the A500. Those were 2x160 pixels to outline the ribbon which then was filled with the Blitter onto the screen. Of course in the Twist-routine I used an in-line modified version of the following code to  avoid unnecessary subroutine branches. This is the 1-Bitplane version.... WORD equ 15
 drawPlanarPixel:
 ; drawing plane => a0 ...could be a buffer, too
 ; X 		=> d0
 ; Y 		=> d1
 ; SCRAP 	=> d2,d3
 movem.l	d0-d3/a0,-(a7)
 
 ; manage the Y position
 move.w	d1,d2		; copy y
 lsl.w	#4,d1		; multiply by 40
 lsl.w	#3,d2
 add.w	d1,d1
 add.w	d2,d1
 add.w	d1,a0		; enter y pos first
 
 ; manage the X position
 move.w	d0,d2		; copy x
 lsr.w	#3,d0		; divide by 8 to get the hardposition
 and.w	#$000f,d2	; mask the 4 lower bits for 0-15 softposition
 btst	#0,d0		; is the hardposition an odd value ?
 beq.s	.even		; nope ...skip the -1 action
 subq.w	#1,d0		; it's odd ...sub 1 to keep even steps (68000!)
 .even:
 sub.w	#WORD,d2
 neg.w	d2
 add.w	d0,a0		; move to X hardposition
 move.w	(a0),d3		; take current screendata word aligned
 bset	d2,d3		; set the softposition pixel
 move.w	d3,(a0)		; put back the modified word
 
 movem.l	(a7)+,d0-d3/a0
 rts
 
Please note, that it is done for convenience so that I just provide decimal X,Y coordinates and a buffer to write to and fire the thing to get my pixel. I'd very much would appreciate any speed-up/optimising tips on this one. I'm not exactly fluent in all the available instructions - especially regarding bitfields and such stuff -  so there must be ways to do this operation more elegantly and efficiantly. Is it a good idea to let the Blitter do this work ? If yes, is there a guide or an example to peek into regarding pixel plotting with the Blitter ? So far the Line-Drawing Mode gave me a headache -_- Thx in advance		 |  
	| ZEROblue Member
 | 
		By extending your bitplanes to f.ex 64 bytes wide for faster addressing you can do: moveq  #-$80, d2ror.b  d0, d2
 lsr.w  #3, d0
 lsl.w  #6, d1
 add.w  d0, d1
 or.b   d2, (a0, d1.w)
 
 |  
	| dalton Member
 | 
		If you're referring to screens 11-13 here at ADA, I'd suggest not plotting any pixels at all. Put any static gfx in odd bitplanes, and then put a triangle in the even. The triangle should be 1 pixel wide on row 1, and extend it's size by one pixel on each side downwards. Then you create a copper list that writes to even bitplane modulo on each scanline. Then simply write the modulo that corresponds to a certain row in the triangle to draw a horisontal line of desired width. Colors can of course also be set using the copper list.		 |  
	| d0DgE Member
 | 
		No, dalton, the ribbon was just an example given on what I used the pixel routine.The colours were set using the copper ;) - it was a 1 bpl effect.
 
 There are of course a lot more occations you can use a fast pixel routine in.
 
 ZeroBlue:
 
 Interesting proposition. I'll try some of this. Thanks :)
 |  
	| dalton Member
 | 
		I suggest something like this for a one-bitplane plot. In principle it's the same as the one you did, only it uses more shortcuts for setting bits and addressing... ; d0/d1 = x/y
 ; a0 = bitplane pointer
 
 asl.w  #6, d1         ; assuming bitplane is 64 bytes wide
 lea    (a0,d1.w), a0
 
 move.w d0, d1     ; copy x
 moveq  #%10000000, d2     ; this is the pixel =)
 and.w  #7, d0    ; mask out bit position
 asr.w  #3, d1     ; get byte offset
 lsr.b  d0, d2     ; shift pixel on position
 or.b   d2, (a0,d1.w) ; put in place
 
There is good tutorial here: http://www.modermodemet.se/dalton/tut/DOTS.TXT (it's in swedish, but code is still code I guess)		 |  
	| Vektor Member
 | 
		Another option. Who calculates the cycles per pixel?
 ; a0 = table with pixels to be plotted
 ; a1 = pre calculated screen multiply table
 ; a2 = pre calculated x division table
 ; a3 = screenpointer
 
 lea Position_table(pc),a0
 lea Shift_Table_x(pc),a1
 lea Mulu_Table_y(pc),a3
 lea screen(pc),a2
 moveq #0,d3
 moveq #num_of_pixels-1,d7
 
 .loop
 
 move.w (a0)+,d0
 move.w (a0)+,d1
 
 add.w  d1,d1	;y=y*2 to get an index
 move.w (a2,d1.w),d1	;screen multiplication tables with multiply value
 
 move.b (a1,d0.w),d3	;add x-word
 add.w d3,d1 	;add x word position
 
 not.w d0 		;shift bit
 bset d0,(a3,d1.W) 	;plot the pixel
 
 dbf d7,.loop
 rts
 |  
	| Vektor Member
 | 
		Found a typing error:
 bset d0,(a3,d1.W) should be  bset d0,(a2,d1.W)
 |  
	| d0DgE Member
 | 
		Nice hints Dalton, ZeroBlue. I've got it implemented and adapted.Also thanks Vektor but I need a rather flexible on-the-fly multi-purpose plotting routy.
 Maybe because I've become a custom to higher language methods like drawCircle();  ;)
 
 Edit:
 
 ... the Y multiplication table trick from Vektor is really neat :D
 |  
	| dalton Member
 | 
		I see now that I posted basically the same routine as ZeroBlue, only his was better =) Should read more carefully I guess...		 |  
	| coyote Member
 | 
		I'm sure you guys noticed that dalton & ZeroBlue wrote routines that won't work on 68000 because of odd address accesses. (probably doesn't matter anyway...)		 |  
	| britelite Member
 | 
		Umm, I can't see any reading or writing .w or .l at odd addresses...		 |  
	| coyote Member
 | 
		Yeah britelite. You are right.Sorry, I must have still been sleeping...
 Mea culpa.  O:-}
 My apologies to dalton and ZEROBlue.
 |  
	| Vektor Member
 | 
		@d0DgE: Correct this routine plots a "simple" predefined array now but it can be used icw eg Bresenham to create a quite fast drawcircle routine. If you're interested I must have it somewhere in my old amiga sourcecodes.		 |  
	| d0DgE Member
 | 
		Britez0r is quite right.ZeroBlue & Dalton's approach workes fine on the 68000.
 The only downside in the long run is the "or" itself, which makes it useful for
 separate bitplane actions only.
 
 @Vektor: of course I'm interested. Send it to "dodge[ät]rowdyclub[döt].de" whenever you like :)
 |  
	| Vektor Member
 | 
		Found it, this was the main plotting algo. I justed checked the entire source code with UAE, with A500 speed it runs in about 1/5 of a frame with a 260 pix wide circle.
 @doDgE: I will email you the entire sourcecode
 
 * a0 = x
 * a1 = y
 * a2 = screen
 * d0 = radius
 
 Draw_circle:
 moveq	#3,d5
 moveq	#6,d6
 
 moveq	#0,d1			;x=0
 move.w	d0,d2
 
 subq.w	#1,d2			;d=r-1
 .loop:
 tst.w	d2
 bpl.b	.no_ydec
 
 subq.w	#1,d0			;y=y-1
 
 add.w	d0,d2			;d=d+y
 
 .no_ydec:
 move.w	a1,d3			;y
 sub.w	d0,d3			;y-r
 
 lsl.w	d6,d3			;(y-r)*schermbreedte
 lea	(a2,d3.w),a3		;screen pointer + y-offset
 
 move.w	a0,d3			;x
 add.w	d1,d3			;x+int x
 
 move.w	d3,d4
 lsr.w	d5,d3
 not.w	d4
 bset	d4,(a3,d3.w)
 
 move.w	a0,d3
 sub.w	d1,d3
 
 move.w	d3,d4
 lsr.w	d5,d3
 not.w	d4
 bset	d4,(a3,d3.w)
 
 move.w	a1,d3
 sub.w	d1,d3
 
 lsl.w	d6,d3
 lea	(a2,d3.w),a3
 
 move.w	a0,d3
 add.w	d0,d3
 
 move.w	d3,d4
 lsr.w	d5,d3
 not.w	d4
 bset	d4,(a3,d3.w)
 
 move.w	a0,d3
 sub.w	d0,d3
 
 move.w	d3,d4
 lsr.w	d5,d3
 not.w	d4
 bset	d4,(a3,d3.w)
 
 move.w	a1,d3
 add.w	d1,d3
 
 lsl.w	d6,d3
 lea	(a2,d3.w),a3
 
 move.w	a0,d3
 add.w	d0,d3
 
 move.w	d3,d4
 lsr.w	d5,d3
 not.w	d4
 bset	d4,(a3,d3.w)
 
 move.w	a0,d3
 sub.w	d0,d3
 
 move.w	d3,d4
 lsr.w	d5,d3
 not.w	d4
 bset	d4,(a3,d3.w)
 
 move.w	a1,d3
 add.w	d0,d3
 
 lsl.w	d6,d3
 lea	(a2,d3.w),a3
 
 move.w	a0,d3
 add.w	d1,d3
 
 move.w	d3,d4
 lsr.w	d5,d3
 not.w	d4
 bset	d4,(a3,d3.w)
 
 move.w	a0,d3
 sub.w	d1,d3
 
 move.w	d3,d4
 lsr.w	d5,d3
 not.w	d4
 bset	d4,(a3,d3.w)
 
 sub.w	d1,d2
 
 subq.w	#2,d2
 addq.w	#1,d1
 
 cmp.w	d0,d1
 bls.w	.loop
 rts
 |  
	| z5_ Member
 | 
		go go go, dodge! :)
 @Vektor: any interest in rejoining the amigascene and code some stuff again? would be cool!
 |  
	| d0DgE Member
 | 
		...by now it finally occured to me that one can create a quite convenient MACRO for this pixel plotting code ... D'OH
 well, you'll stop learning
 |  
	| Vektor Member
 | 
		@z5_, If you have interesting idea's I'm always open to code / review some things but don't expect too much!		 |  
	| Azure Member
 | 
		It has been a long time since I did this, but this looks awefully wasteful to me.
 Is this routine supposed to be optimized for 68000 or 68060? I dug around my old backups and found a 3d dotrotator I coded once. I don't think I have ever used it anywhere. It uses a similar approach as the one Mr. Pet did in roots, but may be slightly more optimized.
 
 The innerloop performs 3D rotation, transformation into the screen space (perspectve) and pixel plotting.
 
 
 .bigloop
 
 REPT	2
 move.l	(a3)+,d3
 move.l	(a3)+,d2
 
 move.l  (a0,d0.w*4),d3  ;a0-a2 precalculated tables with
 add.l   (a1,d2.w*4),d3  ;M-entries. 512 longwords each
 add.l   (a2,d5.w*4),d3
 ;d0=00000000SyyyyyyySzzzzzzzSxxxxxxx
 move.l  (a4,d3.w*4),d1  ;Perspective for x (SzzzzzzzSxxxxxxx)
 bfset   (a6){d4:1}      ;setpixel (a6=planepointer)
 lsr.l   #8,d3		;12 free cycles...
 swap	d2
 swap	d0
 add.l   (a5,d3.w*4),d1  ;Perspective for y (SyyyyyyySzzzzzzz)
 ;d1=Dotadress (pixnr)+planeoffset for
 ;colors
 ;d1 highword=0
 
 move.l	(a3)+,d5
 move.l  (a0,d3.w*4),d3  ;a0-a2 precalculated tables with
 add.l   (a1,d5.w*4),d3  ;M-entries. 512 longwords each
 add.l   (a2,d2.w*4),d3
 ;d0=00000000SyyyyyyySzzzzzzzSxxxxxxx
 move.l  (a4,d3.w*4),d4  ;Perspective for x (SzzzzzzzSxxxxxxx)
 bfset   (a6){d1:1}      ;setpixel (a6=planepointer)
 lsr.l   #8,d3
 swap	d5
 add.l   (a5,d3.w*4),d4  ;Perspective for y (SyyyyyyySzzzzzzz)
 ;d1=Dotadress (pixnr)+planeoffset for
 ;colors
 ;d1 highword=0
 ENDR
 |  
	| Vektor Member
 | 
		@Azure: my routine is 68000 based. I looked at yours and except for the perspective precalc with the z coordinates in the upper word and the bfset (030+?) the aproach is basically the same, precalc everything, use the coordinates as index (which can be done within the instruction on 020+) The only thing I don't get are your first (three) longword moves, the third overwrites the first?
 |  
	| Azure Member
 | 
		...the first move should probably be to D0. I was not able to check whether the sourcecode was functional.
 bfset is very neat, as it allows to avoid separate shifting to calculate the address offset. There is really just a single instrution responsible for the plotting in this routine, the remaining instructions are for 3d calculations.
 |  
	| Rebb Member
 | 
		My version of the pixel plotter. Already got some good tips here (removing the mulu), but as this is my first plotter i guess there's still lot of room for improvement. plot:;takes d0=color,d1=x,d2=y,a0=bplane
 
 
 findy:
 ; multiply y with 40 to get add factor for bitplane
 
 move.w d2,d3
 lsl.w	#4,d2
 lsl.w   #3,d3
 add.w   d2,d2
 add.w   d3,d2
 add.w	d2,a0
 
 checkplane:
 btst.l	#0,d0	; testbit on colorvalue to get planes to plot
 beq	plane2
 jsr	pixset
 
 plane2:
 lea	bplane,a0 ; bitplane address to a0
 add.l	d2,a0   ; start address for correct line
 add.l	#10240,a0 ; address of plane
 btst.l  #1,d0
 beq 	plane3
 jsr	pixset
 
 plane3:
 lea	bplane,a0
 add.l	d2,a0   ; start address for correct line
 add.l	#20480,a0
 btst.l	#2,d0
 beq	plane4
 jsr	pixset
 
 plane4:
 
 lea	bplane,a0
 add.l	d2,a0
 add.l	#30720,a0
 btst.l  #3,d0
 beq     plane5
 jsr	pixset
 
 
 
 plane5:
 
 lea	bplane,a0
 add.l	d2,a0   ; start address for correct line
 add.l	#40960,a0
 btst.l	#4,d0
 beq 	out
 jsr	pixset
 
 
 out:
 rts
 
 pixset:
 move.l	d1,d4		; copy x to d4
 move.l	d1,d5		; and d5
 move.l	d1,d3		; and d3
 lsr.l	#3,d3		; divide with 8 to get number of byte
 add.l	d3,a0		; get to the byte we are changing
 
 asl.l	#3,d3		; How many times did x fit in 8?
 cmp	#0,d3		; If zero, x is directly the bits to set
 beq	nolla		;
 sub.l   d3,d4		; Substract multiply of 8 from original x
 move.l	d4,d5		; to get pixel number
 nolla:
 
 move.l  #7,d6		; substract 7 from pixel number
 sub	d5,d6		; to get right bit
 bset	d6,(a0)		; set the "d6 th bit" on a0
 
 
 rts
edit: What is a good way to "time out" routines like this, when optimising?		 |  
	| pmc Member
 | 
		Rebb:  edit: What is a good way to "time out" routines like this, when optimising?Do you mean: what's a good way to see how long the routine takes to execute? If so, then seeing how many raster lines it takes will give a good indication. To do that, before your routine wait for a screen position and change the background colour. At the end of your routine, change the background colour to what it was before you changed it at the start of your routine. The number of coloured lines you can see is now the number of raster lines your routine took. This code will do that for you: .wt_line:	cmp.b	#160,$dff006bne.s	.wt_line
 move.w	#$0fff,$dff180
 
 <your routine here>
 
 move.w	#$0000,$dff180
 |  
	| Vektor Member
 | 
		To time a routine the easiest way is just to write a color change (#0 or #$0fff) to the dff180. You will see how many raster lines your routine takes... (I see now PMC has already answered this one..)
 Maybe this gives some ideas!
 
 
 
 plot:	;takes d0=color,d1=x,d2=y,a0=bplane
 
 lea bplane(pc),a0
 lea screenpointers(pc),a1
 lea y_mulitply(pc),a2
 lea x_words(pc),a3
 
 lea dot_to_plot_table(pc),a5
 
 moveq #0,d0
 moveq #0,d1
 moveq #0,d2
 moveq #0,d3
 moveq #0,d4
 moveq #0,d7
 
 move.w (a5)+,d7 ;number of pixels to be plot
 
 .loop
 movem.w (a5)+,d0-d2;
 
 add.w d0,d0 ;(x2 )
 add.w d0,d0 ;(twice x2 makes x4 to make an index)
 move.l (a1,d0),a0                     ;add the right value to the bplane pointer
 
 add.w d2,d2 ;y=y*2 to get an index
 move.w (a2,d2.w),d2 ;screen multiplication tables with multiply value
 
 move.b (a3,d1.w),d3 ;add x-word
 add.w d3,d2 ;add x word position to the y position
 
 not.w d0 ;shift to ensure the right bit is set
 bset d0,(a3,d2.W) ;plot the pixel
 
 dbf d7,.loop
 
 rts
 
 screenpointers:
 dc.l bplane
 dc.l bplane+10240
 dc.l bplane+2*10240
 dc.l bplane+3*10240
 dc.l bplane+4*10240
 
 x_words:
 dc.b 0,0,0,0,0,0,0,0
 dc.b 1,1,1,1,1,1,1,1
 dc.b 2,2,2,2,2,2,2,2
 dc.b 3,3,3,3,3,3,3,3
 dc.b 4,4,4,4,4,4,4,4
 dc.b etc
 
 y_multiply:
 dc.w 0
 dc.w screen_width ; in bytes
 dc.w 1*screen_width ; in bytes
 dc.w 2*screen_width ; in bytes
 dc.w 3*screen_width ; in bytes
 dc.w 4*screen_width ; in bytes
 dc.w etc
 
 dot_to_plot_table:
 dc.w 4-1; number of pixels (minus 1) to be plot
 dc.w 0,0,200; plane,x,y
 dc.w 1, 200,200
 dc.w 2, 200,0
 dc.w 3,0,0
 
 bplane:
 dcb.b 5*10240,0
 |  
	| Kalms Member
 | 
		Rebb:
 your routine will invoke "pixset" multiple times. Most of the code in "pixset" will give the exact same result every time you invoke it. Thus those calculations can be moved out of the "pixset" routine.
 
 In order to get some simple metrics, consider these:
 
 * how many instructions do you execute when plotting a pixel with color 1?
 * how many instructions do you execute when plotting a pixel with color 31?
 
 Pick one or several metrics of the kind above, decide which are important to you, and try to improve those metrics.
 |  
	| ZEROblue Member
 | 
		Make sure you have consistent DMA activity across the lines you are measuring over using the above method, or the result might be a completely wrong indication.
 A high amount of DMA activity (many bitplanes, sprites, audio, blitter running etc.) can halt the CPU severely, and going from 200 to 100 colored lines doesn't necessarily mean your routine is now twice as fast, and so this may be a very inexact method.
 
 However if you're just looking to see if your routine simply becomes faster or slower it will work fine. Typically you would then find f.ex how many dots you can plot in the context of your demo part and still maintain the same frame rate.
 |  
	| noname Member
 | 
		I would generally try to avoid the use of a setPixel function by all means. In this respect, Azure's post has leading for me. Also, macros might come in handy to inline frequently used subroutines.		 |  
	| d0DgE Member
 | 
		Exactly. The massive amounts of subroutine branches (bsr setPixel) during a drawCircle for example really slowed down my first circle draw routines. JSR is even slower. So building a tiny MACRO with the very essential lines of setting a bit at an  X | Y position is a really neat thing to implement. @Rebb: By scanning through your code example it occured to me, that you only ask for planes to be drawn into, not those where you might clear a bit in order to set the right colour ( 0 - 31). That could result in less available colours or even distort the complete screen result. As you were showing your >1 Bitplane approach I can still give my version for a 5-Bitplane pixeldraw. Please note: this is still the totally bloated- slow as hell version with no improvements implemented that this nice thread offered. I used this thing to pre-render the pixelplasma-animation shown in Wasted Years' end screen and it is the  very reason I had to build a "werkkzeug" loaderbar :/ This routine is very convenient when it comes to the colour values. You just drop f.e. "17" to _drpColour, give the coordinates and screen and fire the damn thing, but it is in no way "real-time" fit word	equ	15
 _drpPlaneSize:	dc.l	plsize
 cnop	0,4
 _drpColours:
 dc.b	0		; 180 : 00
 dc.b	%00000001	; 182 : 01
 dc.b	%00000010	; 184 : 02
 dc.b	%00000011	; 186 : 03
 dc.b	%00000100	; 188 : 04
 dc.b	%00000101	; 18a : 05
 dc.b	%00000110	; 18c : 06
 dc.b	%00000111	; 18e : 07
 dc.b	%00001000	; 190 : 08
 dc.b	%00001001	; 192 : 09
 dc.b	%00001010	; 194 : 10
 dc.b	%00001011	; 196 : 11
 dc.b	%00001100	; 198 : 12
 dc.b	%00001101	; 19a : 13
 dc.b	%00001110	; 19c : 14
 dc.b	%00001111	; 19e : 15
 dc.b	%00010000	; 1a0 : 16
 dc.b	%00010001	; 1a2 : 17
 dc.b	%00010010	; 1a4 : 18
 dc.b	%00010011	; 1a6 : 19
 dc.b	%00010100	; 1a8 : 20
 dc.b	%00010101	; 1aa : 21
 dc.b	%00010110	; 1ac : 22
 dc.b	%00010111	; 1ae : 23
 dc.b	%00011000	; 1b0 : 24
 dc.b	%00011001	; 1b2 : 25
 dc.b	%00011010	; 1b4 : 26
 dc.b	%00011011	; 1b6 : 27
 dc.b	%00011100	; 1b8 : 28
 dc.b	%00011101	; 1ba : 29
 dc.b	%00011110	; 1bc : 30
 dc.b	%00011111	; 1be : 31
 _drpColour:
 dc.b	0
 cnop	0,4
 drawRealPixel:
 ; drawing plane => a0
 ; X 		=> d0
 ; Y 		=> d1
 ; colour offset => d3
 ; SCRAP 	=> d2-d5
 
 movem.l	d0-d7/a0/a1,-(a7)
 lea	_drpColours(pc),a1
 move.l	_drpPlaneSize(pc),d6
 
 moveq	#0,d4
 tst.b	d3		; is there a colour?
 bne.s	.ok
 movem.l	(a7)+,d0-d7/a0/a1
 rts
 .ok:
 move.b	(a1,d3),d4	; colour byte
 moveq	#0,d5		; first bit in colour byte
 ; move to position
 move.w	d1,d2		; copy y
 lsl.w	#4,d1		; multiply by 40
 lsl.w	#3,d2
 add.w	d1,d1
 add.w	d2,d1
 add.w	d1,a0		; y pos first
 ; manage the X position
 move.w	d0,d2		; copy x
 lsr.w	#3,d0		; divide by 8 to get the hardposition
 and.w	#$000f,d2	; mask the 4 lower bits for 0-15 softposition
 btst	#0,d0		; is the hardposition an odd value ?
 beq.s	.even		; nope ...skip the -1 action
 subq.w	#1,d0		; it's odd ...sub 1 to keep even steps (A500!)
 .even:
 sub.w	#word,d2
 neg.w	d2
 add.w	d0,a0		; move to hardposition
 
 moveq	#planes-1,d7
 .drawlp:
 move.w	(a0),d3		; take current screendata word sized
 btst	d5,d4		; bit 0 or 1 (i.e. clear or draw)
 beq.s	.clear
 bset	d2,d3		; colour bit is lit -> set it lit in the data
 bra.s	.cont
 .clear:
 bclr	d2,d3		; colour bit is clr -> clear it in the data
 .cont:
 move.w	d3,(a0)		; write back the modified data
 addq.b	#1,d5		; prepare next colour bit to test
 add.l	d6,a0		; jump to next plane
 dbf	d7,.drawlp
 .end:
 movem.l	(a7)+,d0-d7/a0/a1
 rts
 
 
 |  
	| sp_ Member
 | 
		Azure's example replaces a matrix multiplication and perspective transformation with a set of small lookuptables.
 9 multiplications and 2 divisions per pixel removed with Dynamic programming.
 
 In CodeTherory Matrix approximations have proven not to give the optimal codes. Dynamic programming might...
 
 There is a faster way to solve this problem on the a500. If I ever finnish my a500 demo I will show you. ;)
 |  
	| Azure Member
 | 
		sp:
 On A500 you can simply hardcore all offsets to the lookup tables and completely unroll the loop. Graham has done something like this on C64 long long ago...
 |  |  |