|
Author |
Message |
z5_
Member |
Makes perfect sense once you know it. Goes into the tips and tricks thread aswell.
|
z5_
Member |
Another general question: in my example above, i was drawing 64 squares which means: 64 loops. In each loop i draw 10 (as an example) lines of 10 pixels, which amounts in 10*10*64 loops.
As far as i know, this would have been an ideal task for the blitter with it's "fill area" capacity (correct me if i'm wrong) but i'm not using the blitter, just the cpu.
Is there any way i can actually reduce the amount of loops? It seems quite silly to draw 10 pixels with 10 move.b instructions when each pixel has the same color. I wondered if it would make sense to preload the color into a longword. Keep in mind that ideally, the square size should be changeable (otherwise i would need a different routine for fixed size/adjustable size).
Also, if somebody asks: how many squares can you draw in a c2p double buffering environment, how is one to guess? Or is it trial and error to determine this.
In general, i'm not looking for asm code but some insights/pointers/tips on this would be helpful.
|
ZEROblue
Member |
The blitter's area fill mode works with bitmap type layouts and can't be used to do this in a chunky type layout, nor can it access external memory which stops you from doing this with blitter copy operations since the chunky buffer is ideally stored in external memory.
It does make sense to optimize by reducing the number of instructions and memory accesses, and there are several ways of doing this. The first one that comes to my mind is to generate blocks of code (or just writing them by hand), one block for each line width you will need, which fill one line using as few instructions as possible and no looping. Of course you could just write a single general routine which moves one or more longwords at a time where possible, but this would call for more code and program logic and would probably not be as fast.
If I remember correctly, word sized or larger accesses to odd memory addresses incur extra cycles, and should be taken into consideration as well.
There's no general answer to the question of how many squares you can draw (in 1/50th of a second?) as it depends on several factors, the size of the squares foremost. Trial and error will only tell you the limit on that very system, using that very code.
|
winden
Member |
It seems quite silly to draw 10 pixels with 10 move.b instructions when each pixel has the same color. I wondered if it would make sense to preload the color into a longword. Keep in mind that ideally, the square size should be changeable (otherwise i would need a different routine for fixed size/adjustable size).
Actually, you are on the good path to optimising :)
If you know that are writing more than 4 pixels, then it makes sense to preload into a longword and write all 4 of them at the same time.
About the fixed/adjustable size...
If you know that you are only doing 10px long lines, then it makes perfect sense to code a program that only does that.
If you know that you do 10px long lines 50% of the time, and some other variables size the rest of 50% time, then it makes sense to have 2 separate codes, one for 10px and other for variable ones.
|
z5_
Member |
Good tips here. "Hardcoding the pixel loop" (thus writing 10 times move.b instead of 10 loops) reduces the amount of loops considerably but leaves no room for different sizes. Each square has the same maximum size but the size can range from 0 to max size. I was thinking of doing something with preloading longwords but can't really find anything that fits into the different size idea.
Is there a trick to move a byte value into a longword 4 times (for example $05 => $05050505) other than using shift + AND?
|
Kalms
Member |
assuming that the upper 24 bits of d0 are cleared, then:
... expands in two cycles on the 060. It's considerably slower on older CPUs :)
Other than that, well:
move.b d0,d1
lsl.w #8,d0
move.b d1,d0
move.w d0,d1
swap d0
move.w d1,d0
... does the job too, without requiring the upper 24 bits of d0 to be cleared beforehand.
You can also use a lookup table.
|
z5_
Member |
Another question. Suppose i use 320*200 using Kalms parameters for the screen setup (see before in this thread).
I want to setup a copper list for a gradient background. Through testing, i have established that waiting for line 72 (dc.w $4801,$ff00) corresponds with the first line of the 320*200 screen. If i change my background color at this place, it corresponds with the same height as my top left corner of my 320*200 screen. This would mean that for my last line, i would need 72+200 which is more than the maximum 255 (dc.w $fe01,ff00) lines allowed.
Am i missing something?
|
Kalms
Member |
@z5: You need *two* waits when you want to cross line 256.
First, wait for a position very close to the end of line 255: dc.w $ffe1,$fffe
Then, wait for the real position (modulo 256), for instance, line 266: dc.w $0a01,$ff00
The reason why this works is that once the copper finishes waiting for position $ffe0, the beam position counter will wrap around to $0000 before another copper instruction is executed.
Ugly? Certainly.
Hmm. It might be that the first wait should be to location $ffdf, not $ffe1. I don't know offhand. One of the two should work though.
|
movew
Member |
Intro / Making A500/OCS code compatible to 060/AGA
Hello!
I'm movew and this is my first post on ADA :)
I watch demos since the early 90's and became more aware of the Amiga scene through a friend of mine (you may know him as Antibyte/SCX).
-
Coding on Amiga is quite new to me, and I want to make things right. So here's my question:
I focus on stock A500 OCS and can test my code on a real A500. How can I improve the 060/AGA compatibility of my production? Currently I test that with various settings of WinUAE and E-UAE to see what a state-of-the-art Amiga might do.
For example, I set BPLCON3 to $c00 and FMODE to 0 in my copperlist, which finally resulted in a correct screen width on E-UAE with AGA emulation.
The goal is to make my upcoming intro viewable on your typical AGA machine :)
Thanks!
(you might know me as movel but I changed that to movew now :b)
|
Blueberry
Member |
Hi movew, and welcome to the world of Amiga coding! :)
You are on the right track with BPLCON3 and FMODE. In addition, you should also set BPLCON4 ($dff10c) to $0011 to get OCS-compatible sprite color offsets.
What kind of system-shutdown code do you use? You should make sure it saves the old actiview from gfxbase, calls LoadView(0) and then calls LoadView with the old view upon exiting. Also, to make sure your demo runs in PAL even on NTSC Amigas, set BEAMCON0 ($dff1dc) to $0020 right after LoadView(0).
|
Blueberry
Member |
@z5: When you are making gradient backgrounds, you should not work for horizontal position $01, since that will change the color before the end of the previous line, resulting in some visual glitches in the right border area. Position $07 works fine.
As for crossing line 256, $ffdf is indeed the right position to wait for. Actually, if the horizontal position you wait for on each line is $df, you can ignore the line 256 issue altogether. As far as I remember, waiting for $df will actually trigger during horizontal blanking, so it should work fine for gradients, and it avoids a really cumbersome special case.
It is a bit strange that waiting for $df triggers later than waiting for $01 of the next line. Can anyone confirm (on a real Amiga) that I remember correctly here?
|
Blueberry
Member |
"Hardcoding the pixel loop" (thus writing 10 times move.b instead of 10 loops) reduces the amount of loops considerably but leaves no room for different sizes.
Unfolded loops can also have dynamic length, by using a computed address jump. So instead of
bra.b lend
loop: move.b d4,(a0)+
lend: dbf d7,loop
you can have
neg.w d7
jmp end(pc,d7.w*2)
rept MAX_WIDTH
move.b d4,(a0)+
endr
end:
Here, MAX_WIDTH can be at most 63 for the end label to be within the 8-bit range of the pc-relative addressing. If you need more, you can just lea the label into an address register and use that as the base for the jump instead of the pc.
|
d0DgE
Member |
Blueberry: I'll try that in due time (gotta clean up the loft after living one week in front of the screen :) after reinstalling the miggy.
After all I want to dig deeper into copper position issues anyway...
|
z5_
Member |
The gradient copper stuff worked. There's one thing i spend ages trying to figure out. Apparently, you need to write the msb (bplcon3: $000) of the color value first, then the lsb (bplcon3: $0200). Otherwise, the lsb is overwritten.
I think i wrote the answer to this question in my own coding tutorial about colors here on A.D.A: if you write the msb value, the lsb value is automatically overwritten with the same value as msb, to maintain backward compatibility. So first write msb, then write lsb.
|
z5_
Member |
@blueberry: very nice tip with the unfolded loops having a dynamic size. Exactkly what i was looking for.
|
ZEROblue
Member |
Hi Movew, welcome to the ADA
You should also disable all caches to guarantee memory coherency to avoid situations you only take into account on the higher CPUs, as well as not making any assumptions on how many rasterlines some piece of code will take to execute so that all rastertiming done with the CPU is fail proof.
A piece of code measuring up to 10 rasterlines on a stock A500 might very well execute in less than 1 rasterline on another system given the right CPU and memory configuration.
F.ex rastertiming like this is not fail proof ...
Loop: cmp.b #$ff, $dff006
bne.s Loop
<short piece of code>
btst #14, $bfe001
bne.s Loop
... until it has been "padded" by f.ex waiting on $fe before $ff or having another fixed wait somewhere on the screen.
I've never programmed anything above 020 so I can't tell you how to push, disable and invalidate the caches on 030 and above without walking on the plank so to speak, but the CacheControl and related calls in Exec will most likely let you exercise all control you need in a transparent manner.
|
Kalms
Member |
blueberry: I tested with WAITs at offset $df on my machine just now.
Example copperlist snippet:
dc.w <offset>,$fffe
REPT 30
dc.w $180,$fff
EN DR
dc.w $180,$000
I can't see the right border on my monitor, therefore I'm making a white line which extends well onto the next scanline on-screen, and observing where the white line ends.
The result I got was that waiting for offset $df was not special in any way. It would finish slightly after waiting for offset $dd, and slightly before waiting for $e1 (or $01 on the next line).
When using the following code to wait for line 260:
dc.w <offset>,$fffe
dc.w $0601,$fffe
dc.w $180 ,$fff
... then using offset $ffdd, $ffdf and $ffe1 worked. Lower values would stop the wait too soon, and higher values would wait forever (because the VPOSR register never goes higher than $ffe1).
Not sure if the above is of any help to you.
|
movew
Member |
Thank you for your replies!
@Blueberry: The shutdown code I used until now restored DMACON and the previous copperlist using graphics.library. Concerning LoadView, I found a nice resource here:
Copper Programming
I will also study their startup- and shutdown code:
startup.asm
Thanks! This helped a lot, proper startup/shutdown is a must!
@ZEROblue: yes, caches should not be messed with. Forgive me, but what exactly is the effect of:
btst #14, $bfe001
I will immerse myself in the hardware manual until I figure it all out.
|
ZEROblue
Member |
The btst #14, $bfe001 is a bad habit I can't seem to lose. I picked it up when I started programming the Amiga from who knows where and it has stuck with me since.
When using btst on an effective address as destination operand, the test will only be done on a byte, so the bit number you're testing will be in the range 0-7, and since 14 modulo 8 = 6 this btst will test the status of the left mouse button as intended.
Good thing you pointed it out 'cause I've stopped thinking about it myself since long ago :)
|
d0DgE
Member |
HAI folks,
as expected Hitchhikr stumbled over my last release and complained that it wont work on the ocs/ecs machines (leaving out the 1st ecs miggy ever, but that's nitpicking...fact: he's right).
So I resurrected the 500 and made it work. But I came across one very strange (or very logical) incident:
for the little ballcluster I use or.l and eor.l on (a0) to imitate the cookie cut effect
from the blitter...but the 500 seems to accept only byte size in that fashion...it's like that...
ballclust:
;get balldata in a1
;chipmem(screen) in a0
;one ball is 32x32
moveq #32-1,d7
.lp:
move.l (a1)+,d0
;then I do some shifting and byte or/eor f
or the softscroll on 4(a0)
;the scoll-overflow data is in d1 then
or.b d1,4(a0)
;test if eor required
eor.b d1,4(a0)
;after that I try to or/eor the main data
or.l d0,(a0)
;some testing if to do an eor
eor.l d0,(a0)
add.l #40,a0 ; next line
dbf d7,.lp
rts
I'm afraid the 68000 has some restrictions regarding long operations on chipmem, right? However the AsmOne (set to 68000 strict) doesn't complain
about that o_O ... the 68k manual didn't gave me that much insight on the matter.
I tried it word sized with swapping d0 but it didn't help :/
|
Kalms
Member |
On 68000, word and longword memory accesses must be word-aligned. If you do .w or .l access to an odd address, you get a bus error on that CPU. Is that what you're seeing?
|
d0DgE
Member |
hmm...what I see is nothing. The operation is not executed and the program halted...replayer continues (CIA interupt), no guru whatsoever.
The odd address issue is most plausible because I move the thing in byte steps over the screen (and right shift the softscroll values 1-7 bits before next whole byte). So I'd better try to fix the routine for 16 bit steps, right ?
Thanks a lot ... now I have a foot hole to start from :)
|
z5_
Member |
A question i have posted before but that i deleted because i think it's a dumb question but anyway. I'm always running in trouble when using byte sized operations so...
A byte can be either signed (-128 to 127) or unsigned (0 to 255). In comparisons, you determine which one you want by the type of comparison instruction used. However, how do i determine what i want in for example a simple add instruction? Is add.b unsigned or signed?
And more general, is there still a point in using byte instructions, speed wise, let's say starting from an A1200?
|
dalton
Member |
add.b is both signed and unsigned... it's just a quistion of how YOU look at the numbers...
for instance, 100+150 euqals 250 which is true, and if you think of it as a signed operation, which means 150 is -106 and 250 is -6 that is also correct (100-106=-6)
magic!
|
Kalms
Member |
z5: add/sub work both for signed and unsigned operand-pairs.
If you want both inputs, and the output, to be unsigned, then add.b will function as you expect it to. Similarly if you want the inputs and the outputs to be signed.
If you want to take the difference of two byte-sized values, each in the range 0..255 (so they are unsigned bytes), then you will be fine if the first value is always smaller than the second value. The result will then also be in the range 0..255 (can be correctly represented as an unsigned byte.)
However, if the two values are arbitrary, then the result will be in the range -255..+255 and will not always fit into a byte. "The result does not fit into a byte" means that some results will be aliased on top of each other; that is, 224-16=208 and 32-80=-48, but both values will be stored as $d0 in the result register. You need to look at the input values to figure out whether 208 or -48 was the intended result.
The easiest way to handle the above situation is to expand both input operands to unsigned words (make sure they are on the $00xx form), and perform a word subtraction. If you interpret both the input operands and the output operand as a signed word, you will get the result you expected.
Performance: add/sub/and/or/eor/not/neg against a register execute at the same speed for .b .w and .l operations on 68020+. However, if you can pack your data more tightly in memory, you need to perform less memory accesses to read/write it, and that helps with performance. Also, sometimes you can perform operation on 4 bytes or 2 words in parallel using .l operations if you are careful, and that gives an immediate speed boost.
|
z5_
Member |
What i actually wanted to do was to add a byte to a word.
moveq #0,d0
move.b (a0),d0 (can be negative or positive but in between -20 and +20)
add.b d0,d2 (d2 can be negative or positive and any "word" value)
The table where a0 points to was word-sized but i figured it would be good coding practise to make it byte sized, as i don't need values less than -128 and more than 127 in it anyway.
I thought about expanding the byte to word first using sign extend but it still doesn't seem to work:
move.b (a0),d0
ext.w d0
add.w d0,d2
|
Kalms
Member |
If d2 contains a word-sized value then you must use add.w when adding anything to it, because add.b will only affect the lowest byte of d2.
So you want to fetch a value from the table, convert it from a signed byte to a signed word, and then add that to d2.
Your second code example does exactly this, and according to your specification it should work.
|
Blueberry
Member |
Sorry, replied to some ancient post... ignore it...
|
z5_
Member |
I'm at a point where i feel i really could do with some help. I've got something simple coded but i'm seeing glitches on screen which aren't there when i pause the screen (glitches from movement). The same glitches i had when using wickedos. On top of that, i have one more important issue to resolve.
So if anybody is prepared to have a look at my code, that would be great. I would not ask if it hadn't got the feeling that i'm going in the wrong direction in some very important and basic aspects.
|
movew
Member |
z5: Hmm, do you mean tearing, means there is an horizontal offset introduced at some scanline? I suppose. Here some thoughts:
- do you have a true 50 Hz screen (for example, a real Amiga and no emulator)? [is the refresh rate of your monitor an integer multiple of 50 Hz (PAL Amiga) or 60 Hz (NTSC Amiga)?]
- do you utilize vertical sync?
Cheers and happy debugging!
|
|
|