things to try: * skip add on the top-byte multiply in sqr8/mul8 * should save a few cycles, suggestion by jamey * patch the entire expanded-ram imul8xe on top of imul8 to avoid the 3-cycle thunk penalty :D * try 3.13 fixed point instead of 4.12 for more precision * can we get away without the extra bit? * since exit compare space would be 6.26 i think so * y-axis mirror optimization * try filling in the extra scanlines on 4x4 and 2x2 tiering * extract viewport for display & re-input via keyboard * fujinet screenshot/viewport uploader