This costs an extra half cycle on average, assuming uniform distribution
of multiplication inputs. I don't think a half cycle is worth an extra
256-byte table.
Improves runtime from 16.24 ms/px to 14.44 ms/px
This uses a routine found on Everything2:
https://everything2.com/title/Fast+6502+multiplication
which uses a lookup table of squares to do 8-bit imuls,
which are then composed into a 16-bit imul