A.D.A. Amiga Demoscene Archive

Amiga Demoscene Archive Forum / Coding / Parallel byte-wise maximum

Author	Message
Blueberry Member	#1 - Posted: 25 Feb 2013 10:04 - Edited Reply Quote Hi! It has been discussed in here a couple of times how to do byte-wise saturated add one longword at a time. But how about byte-wise maximum? If we have byte values up to 127 in all four bytes of D0 and D1, we can perform a byte-wise maximum like this: move.l #$80808080,d2 sub.l d1,d0 ; byte-wise differences with overflows add.l d2,d0 ; overflows negated, msb of each byte is 1 iff difference is positive and.l d0,d2 ; 128 for bytes where difference is positive, 0 for the others move.l d2,d3 lsr.l #7,d3 sub.l d3,d2 ; 127 for bytes where difference is positive, 0 for the others and.l d2,d0 ; only positive differences add.l d1,d0 ; original + positive differences = maximum With appropriate pairing, these 9 instructions result in an overhead (compared to simply using one of the sources) of 1.125 cycles per byte. One of the sources (D1) can be in memory for no extra cost. Can it be done better? Can we get down to 1 cycle? :) Suppose we need, say, 6 bits of precision on our byte values, but those 6 bits have to be placed in the upper bits of each byte (in the memory operand and the destination). We can do this by a move (from memory) and an lsr #1 at the beginning and an lsl #1 at the end, resulting in 3 more instructions (1.5 cycles per byte total). Is there perhaps some better way to do it?
jamie2010 Member	#2 - Posted: 25 Feb 2013 14:14 Reply Quote Hey, Do you need to do this sort of operation several times by pixel?
jamie2010 Member	#3 - Posted: 25 Feb 2013 14:17 Reply Quote By Pixels i mean for 4 pixels.
dalton Member	#4 - Posted: 26 Feb 2013 08:43 Reply Quote I can't think of a quicker way to perform this operation. What bothers me is that so many cycles are spent on creating a bit mask from the sign bits (x-x>>7). Another way of doing it would be (x>>7)*127. It saves one line of code and one register. But I don't think it's any faster =( move.l #$80808080,d2 sub.l d1,d0 ; byte-wise differences with overflows add.l d2,d0 ; overflows negated, msb of each byte is 1 iff difference is positive and.l d0,d2 ; 128 for bytes where difference is positive, 0 for the others lsr.l #7,d2 ; 1 for bytes where difference is positive, 0 for the others mulu.l #127,d2 ; 127 for bytes where difference is positive, 0 for the others and.l d2,d0 ; only positive differences add.l d1,d0 ; original + positive differences = maximum
Blueberry Member	#5 - Posted: 26 Feb 2013 11:22 Reply Quote That's right. The mulu instruction takes 2 cycles and does not pair. Thus, it is equivalent to 4 pairable instructions. One way to save one instruction in some cases would be to generate the incoming pixel values (D0 above) between 128 and 255 instead of between 0 and 127. Then we would not need the first add. And in many cases this could be done without any extra cost (if the values come from a lookup table, for instance). jamie: Usually it would be done several times per pixel, yes, but in an unpredictable manner. Typically, it would be small pieces of graphics drawn scattered all over the screen at some computed positions. Particles, essentially. Another issue is, if this comes from small pieces of graphics drawn at arbitrary positions, the source will often be unaligned with the destination. Aligning the source gives a one-cycle penalty for both the read and the write of the destination. Aligning the destination only gives the penalty for the read of the source, but could result in more complex per-scanline code.
jamie2010 Member	#6 - Posted: 26 Feb 2013 13:22 Reply Quote Maybe you can use a 16 bit buffer and clamp-reorganize the buffer in the c2p.
Blueberry Member	#7 - Posted: 26 Feb 2013 17:13 Reply Quote Clamping in postprocessing works for saturated add (since clamping once at the end is equivalent to clamping every time if you have enough overflow headroom), but how would you do it for maximum? You can't compute the maximum of a series of values from the sum of the values. Also, if the buffer is 16-bit, we can only work on two values at a time instead of four. I don't think that pays off.
jamie2010 Member	#8 - Posted: 26 Feb 2013 18:23 Reply Quote For particles: _ Use one tile (32*32) for rendering _ Render all "visible" particles with no clamping in this tile using 16 bits buffer _ Clamp and transfer this tile from fast to chip memory Not sure what king of operations you use for your particles.
jamie2010 Member	#9 - Posted: 26 Feb 2013 18:44 Reply Quote If you want the maximum it's another story:)
Raylight Member	#10 - Posted: 12 Mar 2013 20:31 - Edited Reply Quote Hmm I spent some time on this one and somewhere there my HD died.. argh! Anyway, the code you posted seems pretty optimal. If it would be possible to get down to 8 instructions I believe a different (high-level) approach needs to be "invented" and optimized from there. Or reformat the input values as you mentioned.. I didn't get very far on the 6-bit problem, but any particular reason for the need to store in the upper bits (\|543210xx\|) per byte? Would \|x543210x\| be of any use?
Blueberry Member	#11 - Posted: 13 Mar 2013 10:13 Reply Quote It has to do with the way my palette is arranged, for compatibility with some other effects. The method I came up with so far is: - Render one scanline at a time in a cached buffer, using the lower 6 bits of each byte - When a scanline is done, shift it up and do some other postprocessing. - Write out the scanline. As for your HD: use a CompactFlash card in an IDE-to-CF adapter. Both are dirt cheap, and it's a both fast and reliable* solution. * Until you accidentally connect the power cable reversed...

A.D.A. Amiga Demoscene Archive, Version 3.0