things to try:

* skip add on the top-byte multiply in sqr8/mul8
  * should save a few cycles, suggestion by jamey

* perform the zx_next = zx^s + cx in 32-bit space, before rounding
  * should improve precision on max zoom, might cost a few cycles

* patch the entire expanded-ram imul8xe on top of imul8 to avoid the 3-cycle thunk penalty :D

* try 3.13 fixed point instead of 4.12 for more precision
  * can we get away without the extra bit?

* y-axis mirror optimization

* 'wide pixels' 2x and 4x for a fuller initial image in the tiered rendering
  * maybe redo tiering to just 4x4, 2x2, 1x1?

* extract viewport for display & re-input via keyboard

* fujinet screenshot/viewport uploader