round16_incdec uses inc and dec
round16_addsub uses adc and sbc
the incdec version is the same when no rounding is needed
but saves about 8 cycles on the rounding cases, for an
average savings of 4.5 cycles for randomly distributed inputs
untested so far
+4 cycles one-time setup
ldx #8 ; 2 cyc for first 8
ldx #8 ; 2 cyc for second 8 (different shift behavior)
-5 cycles/iter to get bit now
lsr arg + 1 ; 5 cyc
rol arg ; 5 cyc
+10 cycles/iter to get the bit in a loop
dex ; 2 cyc
bne ; 2 cyc
4 cycles/iter for the loop
4 + (14 * 16) = 4 + 224 = 228