Instead of relying solely on the JMP thunks added to
imul16_func and sqr16_func, three call sites within the
mandelbrot iteration function are patched directly to
jsr to the XE versions, saving like 15 cycles per iter
Ok so it's not a lot, but every seconds counts. ;)
with XE code disabled:
1539 us/iter
5m13s
with old XE code:
1417 us/iter
4m48s
with new XE code:
1406 us/iter
4m45s
for some reason rounding is giving me wrong results
not sure what i'm doing wrong :D
just show 6 digits :P
ok this gets the us/iter working, and it is more stable
but the elapsed time still needs to be added
it does this weird thing where sometimes it's reading out wrong digits
and then switches to expected unit of sec/px
work in progress no clue what's going on
this frees up 12 bytes of zero page space and costs no measurable
time as these variables are not in the hot path and there was only
a tiny bit different.
rather than saving 0 into the high bytes, then adding the high-byte
multiplication later, write it directly in place. this saves a few
cycles on every iteration, and it adds up nicely.
View 1 overview render times:
130XE: 10.050 ms/px - 4m56s
800XL: 10.906 ms/px - 5m21s
Iterate at fill_masks[fill_level]+1 instead of every pixel and then
skipping, saves a smidge of time
view 1 with expanded memory:
10.514 ms/px before
10.430 ms/px after