A.D.A. Amiga Demoscene Archive

        Welcome guest!

  

  

  

log in with SceneID

  

Demos Amiga Demoscene Archive Forum / Coding / coding tutorial: general questions
 Page:  ««  1  2  3  4  5  6  7  8  »» 
Author Message
z5_
Member
#1 - Posted: 28 Aug 2007 12:03
Reply Quote
I've got one general "good coding practise" question. In the above example, i notice this:
rept 256
movem.l
endr

I know that this can reduce the number of loops considerably (thus making it faster) with bigger code size and executable as trade-off. In general, where does one draw the line? Is it common practise to unroll so much?
Kalms
Member
#2 - Posted: 28 Aug 2007 13:16 - Edited
Reply Quote
For A500, you'll unroll a lot. Unrolling on 68000 can only improve performance. The limiting factor is how much memory the loop will consume (can you still fit your entire program into memory?). The example above will consume 1kB of memory, and is about 10% faster than having a single MOVEM inside a DBF loop.

For 68020/030, you have a 256-byte instruction cache. Instruction reads are costly, so it is probably better to have a single MOVEM.L inside a DBF loop than having the loop fully unrolled. The optimal solution is to unroll somewhere between 1 to 60 times, because:
1) The loop must fit into the instruction cache, otherwise there will be lots of unnecessary memory accesses
2) Unrolling more will decrease the number of loop control (DBF) instructions that you execute
3) Unrolling more will cause more instruction fetches (when filling the cache)

I used to unroll critical loops 4 to 16 times on 68030.

For 68060, you have an 8kB instruction cache. Loop control is quick, and memory is way slower than the processor. Avoiding instruction fetches is important. Therefore, unrolling the loop in the above example will not gain much speed (a percent or so, in the best case). You are more likely to lose speed overall: even though the unrolled loop fits into the instruction cache, loading the unrolled loop into the cache will push out 1kB of unrelated code. The next time that piece of code needs to execute, it needs to be re-loaded.
Therefore, unrolling more than 4-16 times is usually counterproductive on 060.

When I unroll something on 060, I unroll it 4-16 times. I only do that for loops which are really short (the code inside will execute in <10 cycles, not counting memory accesses) and will do many (256+) iterations.
korruptor
Member
#3 - Posted: 30 Aug 2007 17:23 - Edited
Reply Quote
Thanks for all your replies, sorry for not getting back to you earlier - I was on holiday.

I'll have a play with mixing and matching, as I'll definitely want more blitter time than CPU. Although I'll probably target just a normal A500 in which case the blitter sounds best.

Thanks again, I love reading the info on this forum :D
michael phipps
Member
#4 - Posted: 31 Aug 2007 16:44
Reply Quote
@Korruptor

Did you check out my nice little tutorial on using the BLITTER for clearing m8! Check it out dude!!!!
korruptor
Member
#5 - Posted: 2 Sep 2007 13:02
Reply Quote
Yes mate - much appreciated! :D
StingRay
Member
#6 - Posted: 6 Sep 2007 22:24 - Edited
Reply Quote
@Korruptor

Did you check out my nice little tutorial on using the BLITTER for clearing m8! Check it out dude!!!!


Errm dear Michael, where is the blitter wait in your clear loop? You shouldn't write code that works on 68000 only, specially not when it's code that's meant to teach people how to use the blitter.

Also, why do you reload the bltcon0+1 registers in the loop? They don't change so it can be done outside the loop! And why do you use the aptr? That's totally unnecessary! And errm, where do you set the modulos? Could not see that anywhere. And why 20*256+40 to set the blitsize? Here's how I would do it:


tst.b $2(a6)
.wblit btst #6,$2(a6)
bne.b .wblit
; preload blitter regs
move.w #$0100,bltcon0(a6)
move.w #$0000,bltcon1(a6)
move.w #$ffff,bltfwm(a6)
move.w #$ffff,bltlwm(a6)
move.w #0,bltdmod(a6) ; important!
move #2-1,d7 d7 ;clear next two bitplanes –1 for counter!
blit_clear_loop:
tst.b $2(a6)
.wblit btst #6,$2(a6)
bne.b .wblit
move.l a0,dpt(a6)
move.w #(256<<6)+(40/2),bltsize(a6)
lea 40*256(a0),a0 ;point to next screen bitplane!!!
dbf d7,blit_clear_loop
rts
michael phipps
Member
#7 - Posted: 7 Sep 2007 09:51
Reply Quote
@ StingRay

Yeah !!! Welcome back m8! Finally there's someone here that knows exactly what his doing!

So I've got some serious to do right?! Okey! First of all Blitter wait is not necessary if you clear
Small chucks of memory on A500 & I know you can't get away with it on higher configs (i.e A1200). Also blitter modulos are not needed if you just doing screen clearing. Next your blitsize
Calculations are different from mine but still has the same result (I think?) I'm gonna have to check! When my A1200 arrives which I purchased off E-Bay. Please forgive me as it's been 10 years since I haven't coded anything on Amiga & some of the things I'm starting to remember!
michael phipps
Member
#8 - Posted: 7 Sep 2007 09:56
Reply Quote
oops.. better add SERIOUS EXPLAINING. Sorry my memory is bad at the mo hehehe..
winden
Member
#9 - Posted: 8 Sep 2007 12:04
Reply Quote
@mp:

destination modulo is always needed no matter what you do, because the last process that used the blitter could have set it to a value which is not zero, just like setting screen bitplanes modulo to the value you need when doing a copperlist.

also, given there is both experienced and newcomers into amiga coding on this forum, if we know something will not work on higher machines then i think we should always go for the safe route and not use the a500-only trick, or at least put a disclaimer as to not applying to higher machines.

i think is that many newcomers are going to be using for learning coding the same high end machine they use to watch demos, or maybe even UAE, so any tips that only work on a500 will probably not work.
z5_
Member
#10 - Posted: 8 Sep 2007 18:33 - Edited
Reply Quote
Is there any difference, speed wise, between:

add.w #4,(a2)
cmp.w #5,(a2)

move.w (a2),d2
add.w #4,d2
cmp.w #5,d2

Or put another way, is there any point in moving the content of a2 into a d register before doing calculations? I assume it is faster but again, is it worth the extra move instructions in simple calculations?
winden
Member
#11 - Posted: 9 Sep 2007 01:48 - Edited
Reply Quote
yes, it's "always" the good way to do it. the first form is doing this internally:

add #4,(a2) ---> read from memory, add, write to  
memory cmp #5,(a2) ---> read from memory, compare


the other form is:

move.w (a2),d2 ---> read form memory
add    #4,d2   ---> add
cmp #5,d2      ---> compare



so the second form is usually less expensive, but you should keep in mind that the second form is not leaving the modified value in memory, which would need this:

move.w (a2),d2 ---> read form memory
add    #4,d2   ---> add
move.w d2,(a2) ---> write to memory
cmp    #5,d2   ---> compare
z5_
Member
#12 - Posted: 9 Sep 2007 10:17
Reply Quote
so the second form is usually less expensive, but you should keep in mind that the second form is not leaving the modified value in memory, which would need this

Exactly, that was why i was wondering if there's a point in moving it into a d-register. You have two extra move instructions. I guess it's only interesting if you do a few calculations on it?
Kalms
Member
#13 - Posted: 9 Sep 2007 12:46
Reply Quote
Rule of thumb: If you do more than one operation on a value, fetch it into a register first.
michael phipps
Member
#14 - Posted: 9 Sep 2007 17:35
Reply Quote
@winden

okey... okey point taken m8! I'll guess I'll have to stick to the rulez on here & give valid advice
from now on instead if that's what you want hehehe..

@stingray

I just got my A1200 today & checked the formulae you were using to calculate Blitsize so I will
Quote your example:-

move.w #(256<<6)+(40/2),bltsize(a6)

As I understand you are calculating 256 pixels for height + 40/2 in words
Across screen am I right??!

Okey, here is my example coz I think I probably got it incorrect in my previous examples – sorry!

Move.w #256*64+20,blitsize(a6)

Here I have used a similar example by *64 instead of shift, I think you would find that it arrives at the same result!

@everyone!!!

Yeah!!! I just my new A1200 machine yesterday and I'm ready to start some serious coding again
So I guess it safe to say that the mighty Wizzball is back in 2007 LOL!!! Not to mention I dug out all my old DD disks which I've kept for over 15 years! Oh dear but when I tried to run them, most of them were corrupted & I have lost a lot of my cool demo routines but I managed to get some back by using disk salv utility & recover about 45 disks out of 350 disks! Isn't that terrible, nevermind... I've still got a lot of my old AGA hardware files & Optimizations which I'm gunna disclosed to you people on here – Yeah!

Btw.. I need a bit of advice on putting up these files on here perhaps z5_ can help me here, I want to put up some example routines onto your website, these routines will be ADF files & WILL work on WinUAE! Please help ASAP!
michael phipps
Member
#15 - Posted: 9 Sep 2007 18:09
Reply Quote
@z5_

Is there any difference speed wise, between:

My answer is yes there is a difference as I will explain. When using this line of addressing (a0)
You are using Address Register Indirect addressing & this takes approx 4 CPU cycles to execute!
Using long Indirect addressing takes approx 8 CPU cycles to execute but if you used the data registers (i.e. d0-d7) they take next to no time to execute! Here I will quote from your example:-

Add.w #4,(a2)
Cmp.w #5,(a2)

Move.w (a2),d2
Add.w #4,d2
Cmp.w #5,d2


Okey, right here is my correct example for you:

Move.w (a2),d2 ;move memory value into d0!
Addq #1,d2 ;super fast add (no CPU cycles used!)
Cmp.w #5,d2

And if you want to store the new value after execution just do this:-

Move.w d2,(a2)
Rts

Simple... GOT IT!!!
z5_
Member
#16 - Posted: 11 Sep 2007 11:08
Reply Quote
The most confusing thing about assembler is for me the question on when to use .w,.b,.l. Take an add instruction for example:

addq.w #5,a0

Can i, in this case, use .w because #5 fits into a word or can i only add .w if i'm sure that the result of the addition (in this case the address a0) will fit into a word?

The same question can be asked for a lot of operations. I have looked in the 68000 reference manual but haven't found a clear explanation. What logic should be applied. Is it the source or destination (or both?) that determines .b,.w,.l?
z5_
Member
#17 - Posted: 11 Sep 2007 11:23
Reply Quote
So, this is probably due to some interaction between the demosystem's double/triple buffering and the C2P. Put up a complete example program (including demosystem) that exhibits this behaviour and we'll give it a look.

@kalms:
I have put an example in the Wos thread. While this question can probably be answered fairly easy by noname (as he knows the inner working of wos), if he doesn't answer maybe you could have a look?
Kalms
Member
#18 - Posted: 11 Sep 2007 11:39
Reply Quote
Generally, the .b / .w / .l operation determines the size of the *operation* but not necessarily the size of the *operands*.

Examples:
An add.w #1,d0 will perform a 16-bit addition, with one 16-bit immediate operand, and one 16-bit register operand (the lower half of d0; upper half will remain untouched).
An addq.w #1,d0 will perform a 16-bit addition, with one 3-bit immediate operand, and one 16-bit register operand.
An add.w (a0),d0 will perform a 16-bit addition, with one 16-bit memory operand, and one 16-bit register operand.
An add.w a0,d0 will perform a 16-bit addition, with two 16-bit register operands (lower half of a0/d0; only lower half of d0 gets updated).

The exception to this rule is operations with an address register as destination operand (add.w #1,a0 for instance). All operations against address registers be performed as 32-bit operations. The source operand must be either 16 or 32 bits. If the source operand is 16 bits, it will be sign-extended to 32 bits before the 32-bit operation takes place. All 32 bits from the address register will be used in the operation (both for reading & writing).

Examples:
An add.w #1,a0 will perform a 32-bit addition, with a 16-bit immediate operand (sign-extended to 32 bits) and a 32-bit register operand. The result is stored in a0.
An addq.w #1,a0 will perform a 32-bit addition, with a 3-bit immediate operand (_zero_extended to 32 bits -- it's stated in the documentation for addq) and a 32-bit register operand. The result is stored in a0.
An addq.l #1,a0 will perform a 32-bit addition, with a 32-bit immediate operand and a 32-bit register operand. The result is stored in a0.
An add.w d0,a0 will perform a 32-bit addition, with a 16-bit register operand (lower half of d0) which gets sign-extended to 32 bits, and a 32-bit register operand. The result is stored in a0.

So:
If you want to update/change a pointer, and it is not kept in an address register, you want to perform a full 32-bit operation on the pointer.
If the pointer is in an address register, you can do either .w or .l suffix(depending on whether the 1st source operand can go with 16bit or needs to be 32bit), since the operation itself will always be 32-bit.

If it is anything else (an index, a counter, etc etc), pick .b .w .l depending on which size you want the *operation* to be. A counter that is going to count to 10000, increasing by 1 each time, needs to be updated using a 16-bit or 32-bit addition; therefore, you will do "add.w #1 / addq.w #1 / add.l #1 / addq.l #1" to advance it.
If you would do "add.b #1" to increase the counter, its value would wrap around at 256.
michael phipps
Member
#19 - Posted: 11 Sep 2007 13:26
Reply Quote
@z5_

Never try to use the .b for your assembler instructions because this slows down Amiga CPU badly. Remmber your Amiga is a 32 bit machine, use it like one!!! Yeah I know 68000 can be very confusing but trust me, you'll get use to it give it time DON'T RUSH!!!

@Kalms

Finally you have started to come out of your hiding place by explaining in full this type of addressing operations and that's cool. I couldn't have explained it better myself so show yourself dude LOL!!!

@StingRay

Where in the hell are you!!! Disappeared again, oh dear! ;)


Btw. How do I put up an example (executable) on here everyone???!
z5_
Member
#20 - Posted: 11 Sep 2007 13:41
Reply Quote
@michael: please try to keep as much on topic as possible and no personal "attacks" (even if i think you mean no harm).

Btw, you can't put an executable on here. There is no point in posting tons of source code aswell. If you've got a well written tutorial, you can open a thread in the code section. If you have source code which compiles and runs on a modern Amiga and puts the tutorial into practise, then even better :o) In that case, i might consider uploading some stuff on ada.

As a last thing, i was told that using .b is faster, even on modern machines. Who's right?
michael phipps
Member
#21 - Posted: 11 Sep 2007 18:59
Reply Quote
@z5_

Sorry! I will try to behave myself, honest! And yes I have some source code available for you to contribute!

Hmmm... (.b) is faster eh?! who told you that????

P.S. Anyone see my demo yet???? am i in shape? let me know ASAP!
StingRay
Member
#22 - Posted: 12 Sep 2007 21:03 - Edited
Reply Quote
Michael: I did not disappear. :) Just a few comments: I perfectly know that you can skip blitter waits on a500 in certain situations, however, this is a forum where people want to learn things! It is not about the best speed optimized code, it is about WORKING code! When you post code that only works on A500 (and don't even mention that) and people just copy/paste it to use it in their productions, they will have a hard time to figure out why it crashes on their 68020+ machines. Also please refrain from posting WRONG information (like that it is not necessary to set blitter modulos for clearing), it doesn't help people who want to learn (keep this in mind). Also, move.b on 68000 doesn't slow down the cpu, at least I don't think it does .:) Anyway, it's nice you are posting here but please double check what you write before you press the Post message button.;) Like, as a bloody good example I'd like to give the code of your Proplayer replayroutine, while the replay was pretty ok you claimed it was 100% pc relative, when I read something like that, I expect it to be true. Yet, when I wanted to use it in one of my productions where I RELIED on the replay to be 100% pc relative, I had quite a bad surprise (and spent quite some time to actually make the replay 100% pc relative ;D). If you are interested, you can find the replay here: http://stingray.untergrund.net/ProPlayer.S

;)

Edit: Just saw Winden's post (hiya! :D) and I totally agree with everything he wrote. :)
michael phipps
Member
#23 - Posted: 13 Sep 2007 01:09
Reply Quote
@StingRay!

You was right about modulo setting's and i was wrong again... & yeah i did check this on a a500 too & can't get away with it! Oh dear sorry, i think i'm becoming public enemy no.1 LOL!!!! As i said before, i now own a A1200 machine so i will start producing WORKING code and check that it actually works, how about that?!

As for my Replayer Routine - That's really cool what you've down in my code but why do you want to make it 100% PC (Program Relative) I thought it was okey as it was, btw did you use it???? Anyway i've now printed it out coz i havent seen all what you've done yet but i'll get you feedback when i've finished checking it out. well i'm very impressed indeed

THANKYOU ;)
TheDarkCoder
Member
#24 - Posted: 13 Sep 2007 23:08
Reply Quote
@Kalms:

is there any difference between ADDQ.W #1,Ax and ADDQ.L #1,Ax ??
It seems to me they have the same effects. But they are different opcodes,
and according to Motorola manual, on 68000 the former is 4 cycle while the latter 8. Am I missing something?

regards
TDC
Kalms
Member
#25 - Posted: 14 Sep 2007 10:22 - Edited
Reply Quote
@theDarkCoder:

I don't have any 68000 machine up and running right now, but I strongly suspect that it is a typo in the manual.


Looking in the online version of MC68000UM (http://www.freescale.com/files/32bit/doc/ref_manu al/MC68000UM.pdf) yields the following figures:

ADD.W <ea>,Dn 4(1/0)+
ADD.L <ea>,Dn 6(1/0)+**
ADDA.W <ea>,An 8(1/0)+
ADDA.L <ea>,An 6(1/0)+**

ADDQ.W #imm,Dn 4(1/0)
ADDQ.L #imm,Dn 8(1/0)
ADDQ.W #imm,An 4(1/0)*
ADDQ.L #imm,An 8(1/0)

SUBQ.W #imm,Dn 4(1/0)
SUBQ.L #imm,Dn 8(1/0)
SUBQ.W #imm,An 8(1/0)*
SUBQ.L #imm,An 8(1/0)

+ means add time for EA calculation to operation
* is not documented! (typo in the manual)
** means "if EA is #imm or Dn/An, increase base time to 8 cycles"

With this in mind, we can see that:
ADD.W Dn,Dn is 4 cycles
ADD.L Dn,Dn is 8 cycles
ADDQ.W #imm,Dn is 4 cycles
ADDQ.L #imm,Dn is 8 cycles
SUBQ.W #imm,Dn is 4 cycles
SUBQ.L #imm,Dn is 8 cycles
... that is, 32bit additions against data registers take longer than 16bit additions. This is because internally, the 68000 only has a 16-bit ALU, so any 32bit operations are performed in two passes.

ADDA.W Dn,An is 8 cycles
ADDA.L Dn,An is 8 cycles
ADDQ.W #imm,An is 4 cycles
ADDQ.L #imm,An is 8 cycles
SUBQ.W #Imm,An is 8 cycles
SUBQ.L #imm,An is 8 cycles
... all computations against address registers are done as full 32-bit computations, that's why all the xxx.W operations should take 8 cycles.

The exception here is ADDQ.W (as you have noticed). Unless they have implemented special hardware acceleration for that particular instruction. I strongly doubt that.


There are more typos in the 68k manual. Another example:

TST.W Dn 4(1/0)
TST.W <ea> 4(1/0)+
TST.L Dn 4(1/0)
TST.L <ea> 4(1/0)+

The numbers within parentheses are the number of word reads/writes required to execute the instruction. Notice how, according to the manual, a TST.W <ea> can be performed with just one memory access? While in reality, the instruction needs two -- one for fetching the instruction, one for fetching the operand.

A more realistic table would look like:

TST.W Dn 4(1/0)
TST.W <ea> 8(2/0)+
TST.L Dn 4(1/0) or 8(1/0)
TST.L <ea> 12(3/0)+

The reason for TST.L Dn potentially taking 8 cycles is because of the 16-bit ALU; the reason for TST.W/TST.L taking 8/12 respectively is because the 68000 is designed to do one memory access per 4 cycles, at most.
TheDarkCoder
Member
#26 - Posted: 16 Sep 2007 15:13
Reply Quote
@kalms

I don't have any 68000 machine up and running right now, but I strongly suspect that it is a typo in the manual.

I did some tests, and it seems that you are right, it's a typo. ADDQ.W #i,Ax is as fast as ADDQ.L #i,Rx .
So there is no difference between ADDQ.W #i,Ax and ADDQ.L #i,Ax.
A curiosity: I looked at a very old book about the 68000, written by Leo J. Scanlon (the original edition is from 1981). In appendix there are the execution timing tables, reproduced by courtesy of Motorola. In the book, the timing for ADDQ.L #i,Ax is indeed 8(1/0)! So I think that the typo was made at Motorola after that book, maybe when they created the PDF version of the manual.

The numbers within parentheses are the number of word reads/writes required to execute the instruction. Notice how, according to the manual, a TST.W <ea> can be performed with just one memory access? While in reality, the instruction needs two -- one for fetching the instruction, one for fetching the operand.

I think that to compute the number of memory access you have to add (1/0) of the instruction to the contribution given by the <ea> , exactly as you do to compute the number of cycles. So I believe that in this case the documentation is correct. I also did some test on the TST, and it seems to me that in this case the manual is correct: my tests indicate that TST.W Dx is as fast as TST.L Dx, and that TST.W (Ay) is approximately twice as slow as TST.W. As you predict from the tables.

br
TDC
Kalms
Member
#27 - Posted: 19 Sep 2007 17:44
Reply Quote
@tdc:

Oh. I forgot to look in the <ea> timing table. You're right about TST. :)
TheDarkCoder
Member
#28 - Posted: 20 Sep 2007 16:46
Reply Quote
ops... I coorect myself:A curiosity: I looked at a very old book about the 68000, written by Leo J. Scanlon (the original edition is from 1981). In appendix there are the execution timing tables, reproduced by courtesy of Motorola. In the book, the timing for ADDQ.L #i,Ax is indeed 8(1/0)! So I think that the typo was made at Motorola after that book, maybe when they created the PDF version of the manual.


in the Leo Scanlon book (which reproduces Motorola's timing tables) both ADDQ.W #i,Ax and ADDQ.L #i,Ax are said to require 8(1/0), as real experiments confirms. So probably Motorola made the typo in later editions of the 68000 URM
michael phipps
Member
#29 - Posted: 2 Oct 2007 00:53
Reply Quote
@Stingray!

Come on m8, haven't heard from ya in awhile now! Don't you want to hear my results as i'm desperate to write something here!!!
StingRay
Member
#30 - Posted: 16 Nov 2007 12:53
Reply Quote
You was right about modulo setting's and i was wrong again... & yeah i did check this on a a500 too & can't get away with it! Oh dear sorry, i think i'm becoming public enemy no.1 LOL!!!! As i said before, i now own a A1200 machine so i will start producing WORKING code and check that it actually works, how about that?!

Sounds good. :)

As for my Replayer Routine - That's really cool what you've down in my code but why do you want to make it 100% PC (Program Relative) I thought it was okey as it was, btw did you use it???? Anyway i've now printed it out coz i havent seen all what you've done yet but i'll get you feedback when i've finished checking it out. well i'm very impressed indeed

THANKYOU ;)


Well, why I made it PC relative is easy to answer, I simply HATE RELOC32 entries. Also, I used the replayer for a musicdisk and I wanted to spread it with the code so that the musicians could easily change the modules. So I simply saved the code as binary and incbin'ed it so that I a) didn't have to give away the original source and b) made it easier for the people to assemble the musicdisk. So what I gave a away looked something like

START: incbin code.bin

SECTION DATA,DATA
mod1 incbin "mod.bla1"
mod2 incbin "mod.bla2"

Needless to say, this only works with 100% pc relative code. :)

About the fixes, the only thing worth noticing is the range check I do in the sample init part because without it there would be memory trashing with certain modules (this bug is also present in the original pt replay).
 Page:  ««  1  2  3  4  5  6  7  8  »» 

  Please log in to comment

  

  

  

 

A.D.A. Amiga Demoscene Archive, Version 3.0