3. How do you guys handle an "simple" additive buffer blend, r = min(a + b,255), when you need 0-255 values?
My old code uses branches, and I've tried various scc/subx-or constructs with both long and byte r/w, and none of the them beats the original version.. hmm.. Well, I do remember optimizing that one for the 060. Anyway the original goes like this:
; In-place, i.e. a = min(a+b,255)
move.l (a0),d0 ; src1, dst
move.l (a1)+,d1 ; src2
add.b d0,d1
bcc.b .ok1
move.b d2,d1 ; d2 = 255
.ok1 ror.l #8,d0
ror.l #8,d1
... ; same for the other bytes
.ok4 ror.l #8,d1
move.l #d1,(a0)+
Seems like branch cache goes a long way..? My branch-free attempts are significantly slower, and it feels like you'd need an improbable mix of sat/no-sat in the a+b result to get any significant amount of prediction errors.. ?
So how does one do this in 2013? :D Surely there must be a faster way, right? I'll be using 64 and 128 color versions a lot, but still, the full 256 color blend is required in some cases. (Thanks Dalton and Blueberry for explaining the "no-overflow" techniques irl and here!)