Compare commits

..

4 commits

Author SHA1 Message Date
3553ce986f shave some cycles off 16-bit squaring with shift instead of add
also fix the comments about how many cycles shift takes
2024-12-31 15:29:40 -08:00
0f49760aa5 unify tables for squaring and multiplication 2024-12-31 02:26:24 -08:00
f06aed0c00 set results from both 8-bit squares first
Since the results from the lo and hi squares don't overlap or overflow,
they can be written directly to the final output location without doing
any addition. Then only the multiplication that goes in the middle needs
any adds.
2024-12-31 02:22:31 -08:00
aee587388d eliminate mul_hibyte512 table
This costs an extra half cycle on average, assuming uniform distribution
of multiplication inputs. I don't think a half cycle is worth an extra
256-byte table.
2024-12-31 02:01:45 -08:00
4 changed files with 263 additions and 637 deletions

847
mandel.s

File diff suppressed because it is too large Load diff

View file

@ -18,7 +18,7 @@ Enjoy! I'll probably work on this off and on for the next few weeks until I've g
## Current state
Basic rendering is functional, with interactive zoom/pan (+/-/arrows) and 6 preset viewports via the number keys.
Basic rendering is functional, with interactive zoom/pan (+/-/arrows) and 4 preset viewports via the number keys.
The 16-bit signed integer multiplication takes two 16-bit inputs and emits one 32-bit output in the zero page, using the Atari OS ROM's floating point registers as workspaces. Inputs are clobbered.
@ -27,7 +27,7 @@ The 16-bit signed integer multiplication takes two 16-bit inputs and emits one 3
* when expanded RAM is available as on 130XE, a 64KB 8-bit multiplication table accelerates the remaining multiplications
* without expanded RAM, a table of half-squares is used to implement the algorithm from https://everything2.com/title/Fast+6502+multiplication
The mandelbrot calculations are done using 3.13-precision fixed point numbers with 6.26-precision intermediates.
The mandelbrot calculations are done using 4.12-precision fixed point numbers. It may be possible to squish this down to 3.13.
Iterations are capped at 255.
@ -47,4 +47,4 @@ Currently produces a `.xex` executable, which can be booted up in common Atari e
## Todo
See ideas in `todo.md`.
See ideas in `todo.md`.

View file

@ -11,40 +11,19 @@ function db(func) {
return lines.join('\n');
}
let squares = [];
for (let i = 0; i < 512; i++) {
squares.push(Math.trunc((i * i + 1) / 2));
}
console.log(
`.segment "TABLES"
.export mul_lobyte256
.export mul_hibyte256
.export mul_hibyte512
.export sqr_lobyte
.export sqr_hibyte
.export mul_lobyte
.export mul_hibyte
; (i * i + 1) / 2 for the multiplier
; (i * i) / 2 for the multiplier
.align 256
mul_lobyte256:
${db((i) => squares[i] & 0xff)}
mul_lobyte:
${db((i) => ((i * i) >> 1) & 0xff)}
.align 256
mul_hibyte256:
${db((i) => (squares[i] >> 8) & 0xff)}
.align 256
mul_hibyte512:
${db((i) => (squares[i + 256] >> 8) & 0xff)}
; (i * i) for the plain squares
.align 256
sqr_lobyte:
${db((i) => (i * i) & 0xff)}
.align 256
sqr_hibyte:
${db((i) => ((i * i) >> 8) & 0xff)}
mul_hibyte:
${db((i) => ((i * i) >> 9) & 0xff)}
`);

12
todo.md
View file

@ -1,17 +1,15 @@
things to try:
* fix status bar to show elapsed time, per-iter time, per-pixel iter count
* 'turbo' mode disabling graphics in full or part
* patch the entire expanded-ram imul8xe on top of imul8 to avoid the 3-cycle thunk penalty :D
* maybe clean up the load/layout of the big mul table
* consider alternate lookup tables in the top 16KB under ROM
* try 3.13 fixed point instead of 4.12 for more precision
* can we get away without the extra bit?
* y-axis mirror optimization
* 'wide pixels' 2x and 4x for a fuller initial image in the tiered rendering
* maybe redo tiering to just 4x4, 2x2, 1x1?
* extract viewport for display & re-input via keyboard
* fujinet screenshot/viewport uploader