It has been discussed in here a couple of
times how to do byte-wise saturated add one longword at a time.
But how about byte-wise maximum? If we have byte values up to 127 in all four bytes of D0 and D1, we can perform a byte-wise maximum like this:
sub.l d1,d0 ; byte-wise differences with overflows
add.l d2,d0 ; overflows negated, msb of each byte is 1 iff difference is positive
and.l d0,d2 ; 128 for bytes where difference is positive, 0 for the others
sub.l d3,d2 ; 127 for bytes where difference is positive, 0 for the others
and.l d2,d0 ; only positive differences
add.l d1,d0 ; original + positive differences = maximum
With appropriate pairing, these 9 instructions result in an overhead (compared to simply using one of the sources) of 1.125 cycles per byte. One of the sources (D1) can be in memory for no extra cost.
Can it be done better? Can we get down to 1 cycle? :)
Suppose we need, say, 6 bits of precision on our byte values, but those 6 bits have to be placed in the upper bits of each byte (in the memory operand and the destination). We can do this by a move (from memory) and an lsr #1 at the beginning and an lsl #1 at the end, resulting in 3 more instructions (1.5 cycles per byte total). Is there perhaps some better way to do it?