Note how many cycles loop unrolling saves

+4 cycles one-time setup ldx #8 ; 2 cyc for first 8 ldx #8 ; 2 cyc for second 8 (different shift behavior) -5 cycles/iter to get bit now lsr arg + 1 ; 5 cyc rol arg ; 5 cyc +10 cycles/iter to get the bit in a loop dex ; 2 cyc bne ; 2 cyc 4 cycles/iter for the loop 4 + (14 * 16) = 4 + 224 = 228
2023-01-04 21:02:18 -08:00 · 2023-01-04 21:02:18 -08:00 · 1408855eed
commit 1408855eed
parent 519f8ad635
1 changed files with 2 additions and 0 deletions
--- a/readme.md
+++ b/readme.md
@ -21,6 +21,8 @@ The 16-bit signed integer multiplication seems to be working, though I need to d
 The main loop is a basic add-and-shift, using 16-bit adds which requires flipping the sign of negative inputs (otherwise you'd have to add all those sign-extension bits). Runs in 470-780 cycles depending on input.
 The loop is unrolled which saves 228 cycles, but at the cost of making the routine quite large. This is an acceptable tradeoff for the Mandelbrot, where imul16 is the dominant performance cost and the rest of the program will be small.
 The mandelbrot loop is partly sketched out but I have future updates to make on that.
 I've also sketched out a 16-bit rounding macro, which is not yet committed.