rather than saving 0 into the high bytes, then adding the high-byte
multiplication later, write it directly in place. this saves a few
cycles on every iteration, and it adds up nicely.
View 1 overview render times:
130XE: 10.050 ms/px - 4m56s
800XL: 10.906 ms/px - 5m21s
was missing an rts on update_palette
this happened to fall through to keycheck
which if timing was wrong would dutifully process the viewport
change and return to update_palette's caller
which in turn was -not- expecting to reset the outer loop
fixed