A.D.A. Amiga Demoscene Archive

  Welcome guest! Please register a new account or log in

  

  

  

Demos Amiga Demoscene Archive Forum / Coding / Parallel byte-wise maximum

 

Author Message
Blueberry
Member
#1 - Posted: 25 Feb 2013 10:04 - Edited
Reply Quote
Hi!

It has been discussed in here a couple of times how to do byte-wise saturated add one longword at a time.

But how about byte-wise maximum? If we have byte values up to 127 in all four bytes of D0 and D1, we can perform a byte-wise maximum like this:

    move.l  #$80808080,d2
sub.l d1,d0 ; byte-wise differences with overflows
add.l d2,d0 ; overflows negated, msb of each byte is 1 iff difference is positive
and.l d0,d2 ; 128 for bytes where difference is positive, 0 for the others
move.l d2,d3
lsr.l #7,d3
sub.l d3,d2 ; 127 for bytes where difference is positive, 0 for the others
and.l d2,d0 ; only positive differences
add.l d1,d0 ; original + positive differences = maximum


With appropriate pairing, these 9 instructions result in an overhead (compared to simply using one of the sources) of 1.125 cycles per byte. One of the sources (D1) can be in memory for no extra cost.

Can it be done better? Can we get down to 1 cycle? :)

Suppose we need, say, 6 bits of precision on our byte values, but those 6 bits have to be placed in the upper bits of each byte (in the memory operand and the destination). We can do this by a move (from memory) and an lsr #1 at the beginning and an lsl #1 at the end, resulting in 3 more instructions (1.5 cycles per byte total). Is there perhaps some better way to do it?
jamie2010
Member
#2 - Posted: 25 Feb 2013 14:14
Reply Quote
Hey,

Do you need to do this sort of operation several times by pixel?
jamie2010
Member
#3 - Posted: 25 Feb 2013 14:17
Reply Quote
By Pixels i mean for 4 pixels.
dalton
Member
#4 - Posted: 26 Feb 2013 08:43
Reply Quote
I can't think of a quicker way to perform this operation. What bothers me is that so many cycles are spent on creating a bit mask from the sign bits (x-x>>7). Another way of doing it would be (x>>7)*127. It saves one line of code and one register. But I don't think it's any faster =(


move.l #$80808080,d2
sub.l d1,d0 ; byte-wise differences with overflows
add.l d2,d0 ; overflows negated, msb of each byte is 1 iff difference is positive
and.l d0,d2 ; 128 for bytes where difference is positive, 0 for the others
lsr.l #7,d2 ; 1 for bytes where difference is positive, 0 for the others
mulu.l #127,d2 ; 127 for bytes where difference is positive, 0 for the others
and.l d2,d0 ; only positive differences
add.l d1,d0 ; original + positive differences = maximum
Blueberry
Member
#5 - Posted: 26 Feb 2013 11:22
Reply Quote
That's right. The mulu instruction takes 2 cycles and does not pair. Thus, it is equivalent to 4 pairable instructions.

One way to save one instruction in some cases would be to generate the incoming pixel values (D0 above) between 128 and 255 instead of between 0 and 127. Then we would not need the first add. And in many cases this could be done without any extra cost (if the values come from a lookup table, for instance).

jamie: Usually it would be done several times per pixel, yes, but in an unpredictable manner. Typically, it would be small pieces of graphics drawn scattered all over the screen at some computed positions. Particles, essentially.

Another issue is, if this comes from small pieces of graphics drawn at arbitrary positions, the source will often be unaligned with the destination. Aligning the source gives a one-cycle penalty for both the read and the write of the destination. Aligning the destination only gives the penalty for the read of the source, but could result in more complex per-scanline code.
jamie2010
Member
#6 - Posted: 26 Feb 2013 13:22
Reply Quote
Maybe you can use a 16 bit buffer and clamp-reorganize the buffer in the c2p.
Blueberry
Member
#7 - Posted: 26 Feb 2013 17:13
Reply Quote
Clamping in postprocessing works for saturated add (since clamping once at the end is equivalent to clamping every time if you have enough overflow headroom), but how would you do it for maximum? You can't compute the maximum of a series of values from the sum of the values.

Also, if the buffer is 16-bit, we can only work on two values at a time instead of four. I don't think that pays off.
jamie2010
Member
#8 - Posted: 26 Feb 2013 18:23
Reply Quote
For particles:
_ Use one tile (32*32) for rendering
_ Render all "visible" particles with no clamping in this tile using 16 bits buffer
_ Clamp and transfer this tile from fast to chip memory

Not sure what king of operations you use for your particles.
jamie2010
Member
#9 - Posted: 26 Feb 2013 18:44
Reply Quote
If you want the maximum it's another story:)
Raylight
Member
#10 - Posted: 12 Mar 2013 20:31 - Edited
Reply Quote
Hmm I spent some time on this one and somewhere there my HD died.. argh! Anyway, the code you posted seems pretty optimal. *If* it would be possible to get down to 8 instructions I believe a different (high-level) approach needs to be "invented" and optimized from there. Or reformat the input values as you mentioned..

I didn't get very far on the 6-bit problem, but any particular reason for the need to store in the upper bits (|543210xx|) per byte? Would |x543210x| be of any use?
Blueberry
Member
#11 - Posted: 13 Mar 2013 10:13
Reply Quote
It has to do with the way my palette is arranged, for compatibility with some other effects.

The method I came up with so far is:

- Render one scanline at a time in a cached buffer, using the lower 6 bits of each byte
- When a scanline is done, shift it up and do some other postprocessing.
- Write out the scanline.


As for your HD: use a CompactFlash card in an IDE-to-CF adapter. Both are dirt cheap, and it's a both fast and reliable* solution.

* Until you accidentally connect the power cable reversed...

 

  Please register a new account or log in to comment

  

  

  

 

A.D.A. Amiga Demoscene Archive, Version 3.0