fix fix

fix panning for 32-bi
update docs for 32-bit intermediates
2024-12-31 15:03:43 -08:00 · 2024-12-31 14:45:38 -08:00 · 2024-12-31 14:16:43 -08:00 · 2024-12-31 13:54:53 -08:00 · 2024-12-31 09:53:22 -08:00 · 2024-12-31 09:09:11 -08:00
8 changed files with 927 additions and 246 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,3 +1,4 @@
 *.o
 *.xex
+tables.s
 .DS_Store
--- a/8
+++ b/8
@ -2,13 +2,17 @@

 all : mandel.xex

-%.xex : %.o
-	ld65 -C atari-asm-xex.cfg -o $@ $<
+mandel.xex : mandel.o tables.o atari-asm-xex.cfg
+	ld65 -C ./atari-asm-xex.cfg -o $@ mandel.o tables.o

 %.o : %.s
 	ca65 -o $@ $<

+tables.s : tables.js
+	node tables.js > tables.s
+
 clean :
+	rm -f tables.s
 	rm -f *.o
 	rm -f *.xex

--- a/atari-asm-xex.cfg
+++ b/atari-asm-xex.cfg
@ -0,0 +1,28 @@
+FEATURES {
+    STARTADDRESS: default = $2E00;
+}
+SYMBOLS {
+    __STARTADDRESS__: type = export, value = %S;
+}
+MEMORY {
+    ZP:      file = "", define = yes, start = $0082, size = $007E;
+    MAIN:    file = %O, define = yes, start = %S,    size = $4000 - %S;
+    # Keep $4000-7fff clear for expanded RAM access window
+    TABLES:  file = %O, define = yes, start = $8000, size = $a000 - $8000;
+    # Keep $a000-$bfff clear for BASIC cartridge
+}
+FILES {
+    %O: format = atari;
+}
+FORMATS {
+    atari: runad = start;
+}
+SEGMENTS {
+    ZEROPAGE: load = ZP,      type = zp,  optional = yes;
+    EXTZP:    load = ZP,      type = zp,  optional = yes; # to enable modules to be able to link to C and assembler programs
+    CODE:     load = MAIN,    type = rw,                  define = yes;
+    RODATA:   load = MAIN,    type = ro   optional = yes;
+    DATA:     load = MAIN,    type = rw   optional = yes;
+    BSS:      load = MAIN,    type = bss, optional = yes, define = yes;
+    TABLES:   load = TABLES,  type = ro,  optional = yes, align = 256;
+}
--- a/mandel.s
+++ b/mandel.s
--- a/readme.md
+++ b/readme.md
@ -14,32 +14,37 @@ Non-goals:

 Enjoy! I'll probably work on this off and on for the next few weeks until I've got it producing fractals.

-- brooke, january 2023 - february 2024
+-- brooke, january 2023 - december 2024

 ## Current state

-Basic rendering is functional, but no interactive behavior (zoom/pan) or benchmarking is done yet.
+Basic rendering is functional, with interactive zoom/pan (+/-/arrows) and 4 preset viewports via the number keys.

-The 16-bit signed integer multiplication works; it takes two 16-bit inputs and emits one 32-bit output in the zero page, using the Atari OS ROM's floating point registers as workspaces. Inputs are clobbered.
+The 16-bit signed integer multiplication takes two 16-bit inputs and emits one 32-bit output in the zero page, using the Atari OS ROM's floating point registers as workspaces. Inputs are clobbered.

-The main loop is a basic add-and-shift, using 16-bit adds which requires flipping the sign of negative inputs (otherwise you'd have to add all those sign-extension bits). Runs in 470-780 cycles depending on input.
+* 16-bit multiplies are decomposed into 4 8-bit unsigned multiplies and some addition
+* an optimized case for squares uses a table of 8-bit squares to reduce the number of 8-bit multiplication sub-ops
+* when expanded RAM is available as on 130XE, a 64KB 8-bit multiplication table accelerates the remaining multiplications
+* without expanded RAM, a table of half-squares is used to implement the algorithm from https://everything2.com/title/Fast+6502+multiplication

-The mandelbrot calculations are done using 4.12-precision fixed point numbers. It may be possible to squish this down to 3.13.
+The mandelbrot calculations are done using 4.12-precision fixed point numbers with 8.24-precision intermediates. It may be possible to squish this down to 3.13/6.26.

 Iterations are capped at 255.

 The pixels are run in a progressive layout to get the basic shape on screen faster.

-## Next steps
+There is a running counter of ms/px using the vertical blank interrupts as a timer, used to track our progress. :D

-Add a running counter of ms/px using the vertical blank interrupts as a timer. This'll show how further work improves it!
+There's a check for cycles in (zx,zy) output when in the 'lake'; if values repeat, they cannot escape. This is a big time saver in fractint.

-Check for cycles in (zx,zy) output when in the 'lake'; if values repeat, they cannot escape. This is a big time saver in fractint.
-
-I may be able to do a faster multiply using tables of squares for 8-bit component multiplication.
+There's some cute color cycling.

 ## Deps and build instructions

 I'm using `ca65` as a macro assembler, and have a Unix-style `Makefile` for building. Should work fairly easily on Linux and Mac. Might work on "raw" Windows but I use WSL for that.

 Currently produces a `.xex` executable, which can be booted up in common Atari emulators and some i/o devices.
+
+## Todo
+
+See ideas in `todo.md`.
--- a/tables.js
+++ b/tables.js
@ -0,0 +1,50 @@
+function db(func) {
+    let lines = [];
+    for (let i = 0; i < 256; i += 16) {
+        let items = [];
+        for (let j = 0; j < 16; j++) {
+            let x = i + j;
+            items.push(func(x));
+        }
+        lines.push('    .byte ' + items.join(', '));
+    }
+    return lines.join('\n');
+}
+
+let squares = [];
+for (let i = 0; i < 512; i++) {
+    squares.push(Math.trunc((i * i + 1) / 2));
+}
+
+console.log(
+`.segment "TABLES"
+
+.export mul_lobyte256
+.export mul_hibyte256
+.export mul_hibyte512
+.export sqr_lobyte
+.export sqr_hibyte
+
+; (i * i + 1) / 2 for the multiplier
+.align 256
+mul_lobyte256:
+${db((i) => squares[i] & 0xff)}
+
+.align 256
+mul_hibyte256:
+${db((i) => (squares[i] >> 8) & 0xff)}
+
+.align 256
+mul_hibyte512:
+${db((i) => (squares[i + 256] >> 8) & 0xff)}
+
+; (i * i) for the plain squares
+.align 256
+sqr_lobyte:
+${db((i) => (i * i) & 0xff)}
+
+.align 256
+sqr_hibyte:
+${db((i) => ((i * i) >> 8) & 0xff)}
+
+`);
--- a/testme.js
+++ b/testme.js
@ -0,0 +1,41 @@
+// ax = (a + x)2/2 - a2/2 - x2/2 
+
+function half_square(x) {
+    return Math.round(x * x / 2) & 0xffff >>> 0;
+}
+
+function mul8(a, b) {
+    let result = half_square(a + b) & 0xffff;
+    result = (result - half_square(a)) & 0xffff;
+    result = (result - half_square(b)) & 0xffff;
+    result = (result + (b & a & 1)) & 0xffff;
+    return result >>> 0;
+}
+
+function mul16(a, b) {
+    let ah = (a & 0xff00) >>> 8;
+    let al = (a & 0x00ff) >>> 0;
+    let bh = (b & 0xff00) >>> 8;
+    let bl = (b & 0x00ff) >>> 0;
+    let result = (mul8(al, bl) & 0xffff) >>> 0;
+    result = ((result + (mul8(ah, bl) << 8)) & 0x00ffffff) >>> 0;
+    result = ((result + (mul8(al, bh) << 8)) & 0x01ffffff) >>> 0;
+    result = ((result + (mul8(ah, bh) << 16)) & 0xffffffff) >>> 0;
+    return result;
+}
+
+let max = 65536;
+//let max = 256;
+//let max = 128;
+//let max = 8;
+
+for (let a = 0; a < max; a++) {
+    for (let b = 0; b < max; b++) {
+        let expected = Math.imul(a, b) >>> 0;
+        //let actual = mul8(a, b);
+        let actual = mul16(a, b);
+        if (expected !== actual) {
+            console.log(`wrong! ${a} * ${b} expected ${expected} got ${actual}`);
+        }
+    }
+}
--- a/todo.md
+++ b/todo.md
@ -0,0 +1,19 @@
+things to try:
+
+* skip add on the top-byte multiply in sqr8/mul8
+  * should save a few cycles, suggestion by jamey
+
+* patch the entire expanded-ram imul8xe on top of imul8 to avoid the 3-cycle thunk penalty :D
+
+* try 3.13 fixed point instead of 4.12 for more precision
+  * can we get away without the extra bit?
+  * since exit compare space would be 6.26 i think so
+
+* y-axis mirror optimization
+
+* 'wide pixels' 2x and 4x for a fuller initial image in the tiered rendering
+  * maybe redo tiering to just 4x4, 2x2, 1x1?
+
+* extract viewport for display & re-input via keyboard
+
+* fujinet screenshot/viewport uploader
Author	SHA1	Message	Date
Brooke Vibber	d8601bb856	fix fix	2024-12-31 15:03:43 -08:00
Brooke Vibber	7985ea9a39	fix panning for 32-bi	2024-12-31 14:45:38 -08:00
Brooke Vibber	cc83c76706	update docs for 32-bit intermediates	2024-12-31 14:16:43 -08:00
Brooke Vibber	2e8893fd78	haha fuck me	2024-12-31 13:54:53 -08:00
Brooke Vibber	81bf7f3c43	tweak	2024-12-31 09:53:22 -08:00
Brooke Vibber	1e0f577e09	wip	2024-12-31 09:09:11 -08:00
Brooke Vibber	d2f41f9644	wip	2024-12-31 09:02:42 -08:00
Brooke Vibber	2fcb30b76a	wip	2024-12-31 08:56:59 -08:00
Brooke Vibber	13257309dc	init fix	2024-12-31 08:34:02 -08:00
Brooke Vibber	7184b8e03f	wip	2024-12-31 08:24:47 -08:00
Brooke Vibber	4a1e35699a	wip	2024-12-31 08:24:44 -08:00
Brooke Vibber	0d086a179c	wip	2024-12-31 08:23:04 -08:00
Brooke Vibber	61eb1aaf21	notes	2024-12-31 05:11:26 -08:00
Brooke Vibber	b56dc1e98b	notes	2024-12-30 20:38:33 -08:00
Brooke Vibber	0a7293d8bc	do 4x4 2x2 1x1 only in prep for bigger pixels	2024-12-30 19:52:35 -08:00
Brooke Vibber	ec42f672d4	use an 8-item z buffer for slightly fasterness	2024-12-30 19:48:28 -08:00
Brooke Vibber	67649d4743	annotations, tweak	2024-12-30 19:17:02 -08:00
Brooke Vibber	ed79c80b16	update readme	2024-12-30 16:50:25 -08:00
Brooke Vibber	e6cbe0bc6b	notes	2024-12-30 16:43:18 -08:00
Brooke Vibber	6db8cef82d	51-70 cycles for xe :D	2024-12-30 15:17:50 -08:00
Brooke Vibber	9b7f6b8937	add a viewport in the front spike	2024-12-30 14:22:03 -08:00
Brooke Vibber	3bd9b1ac31	micro-optimizations in imul8xe 53-72 cycles overview in 10.896 ms/px	2024-12-30 14:09:02 -08:00
Brooke Vibber	63e74d5152	tweak	2024-12-30 13:44:31 -08:00
Brooke Vibber	14125a398a	cycle 'in' not 'out'	2024-12-30 11:35:45 -08:00
Brooke Vibber	71d8d93abc	even better palette cycling	2024-12-30 11:33:55 -08:00
Brooke Vibber	64a6cf50f3	awesome new palette cycler	2024-12-30 10:21:52 -08:00
Brooke Vibber	100c0f3314	1/2/3 selectable viewports	2024-12-30 09:19:41 -08:00
Brooke Vibber	e51aa91e4e	notes	2024-12-30 06:48:04 -08:00
Brooke Vibber	c4b98c7be2	optimize out a temporary down to 11.076 ms/px on xe	2024-12-30 05:35:22 -08:00
Brooke Vibber	70d2c91f03	fix bank switch on xl/xe was accidentally enabling basic rom :D 5m46s - 11.759 ms/px - 800xl 5m30s - 11.215 ms/px - 130xe	2024-12-30 03:56:35 -08:00
Brooke Vibber	acac5a8df4	moving the framebuffer into the basic space fails on 130xe and 800xl for some reason works on 800 as expected	2024-12-29 21:19:55 -08:00
Brooke Vibber	883f926e57	split memory, wip appears to work on 800 but xl/xe overlap basic lol	2024-12-29 21:06:48 -08:00
Brooke Vibber	0c63430dd9	wip tables segment to be	2024-12-29 20:37:58 -08:00
Brooke Vibber	3ab5006aa3	wip refacotring	2024-12-29 17:56:14 -08:00
Brooke Vibber	f903272335	refactoring and start on squares	2024-12-29 17:37:06 -08:00
Brooke Vibber	8ad996981a	whoops	2024-12-29 13:19:58 -08:00
Brooke Vibber	15fc5367f9	switck with the overview as default fo rnow	2024-12-29 13:18:54 -08:00
Brooke Vibber	2118890977	add an alternate viewport (compile-time currently) zoomed to max	2024-12-29 13:10:35 -08:00
Brooke Vibber	0fc5ba914f	fix pan/zoom bug was missing an rts on update_palette this happened to fall through to keycheck which if timing was wrong would dutifully process the viewport change and return to update_palette's caller which in turn was -not- expecting to reset the outer loop fixed	2024-12-29 12:29:36 -08:00
Brooke Vibber	2b0167226e	todos	2024-12-28 20:44:27 -08:00
Brooke Vibber	504457595a	correct zoom border checks	2024-12-28 18:11:35 -08:00
Brooke Vibber	0fcf4d6676	comment tweak	2024-12-28 17:40:21 -08:00
Brooke Vibber	d83b811444	remove stray copy of the expanded-ram imul it's not finished or working, just keep the core one :D	2024-12-28 15:13:06 -08:00
Brooke Vibber	f32cc5fa7c	whoops	2024-12-27 19:15:19 -08:00
Brooke Vibber	052a19b6aa	Merge pull request 'xe' (#1 ) from xe into main Reviewed-on: https://brooke.vibber.net/git/git/brooke/mandel-6502/pulls/1	2024-12-28 02:40:01 +00:00
Brooke Vibber	83cba4afa3	Runtime detection of XE-style extended memory Uses the "big multiplication table" in 64KB of extended memory if bank switching appears to work, otherwise uses the table of squares lookups. Initial view clocks in at 13.133 ms/px for the XE version and still 14.211 ms/px for the 400/800/XL version. Tested in emulator with 130XE and XL+Ultimate 1MB upgrade configs, and base implementation on the 800XL emulator.	2024-12-27 18:37:03 -08:00
Brooke Vibber	ee1c268705	it works	2024-12-26 21:49:13 -08:00
Brooke Vibber	e84a990789	tweaks:	2024-12-26 21:41:03 -08:00
Brooke Vibber	0cde31905e	runs but doesn't work	2024-12-26 18:35:37 -08:00
Brooke Vibber	45c5a4cb2d	called, gets lost	2024-12-26 18:20:10 -08:00
Brooke Vibber	34ce9da030	builds, not used yte	2024-12-26 18:17:01 -08:00
Brooke Vibber	a9d551a98d	first draft initializer	2024-12-26 17:50:59 -08:00
Brooke Vibber	829d2860e8	:P	2024-12-26 12:04:01 -08:00
Brooke Vibber	f996c3cbcd	provisional maybe old mode runs in 81-92 cycles provisional code runs in 58-77 cycles if it works ;)	2024-12-25 12:47:37 -08:00
Brooke Vibber	405cec6d51	WIP imul8 via table experiments planning to try a 64KB table of 8x7-bit multiplies in the high memory on a 130XE or other high-memory-capable machine not yet working or finished too many cycles of overhead per invocation	2024-12-25 10:51:27 -08:00
Brooke Vibber	05133aabdd	slightly faster handling of signed mul previously we were flipping the inputs if negative, and then the output if both inputs were negative turns out you can just treat the whole thing as an unsigned mul and then subtract each term from the high word if the other term is negative. https://stackoverflow.com/a/28827013 this saves a handful of cycles, reducing our runtime to a merge 14.211 ms/px \o/	2024-12-15 20:17:45 -08:00
Brooke Vibber	7f2bc43cff	squares	2024-12-14 18:56:26 -08:00
Brion Vibber	5637783529	Faster imul16 routine Improves runtime from 16.24 ms/px to 14.44 ms/px This uses a routine found on Everything2: https://everything2.com/title/Fast+6502+multiplication which uses a lookup table of squares to do 8-bit imuls, which are then composed into a 16-bit imul	2024-12-14 18:53:31 -08:00