Efficient and Concise Multiplication in CHIP-8

I've entertained how to nicely express a decent many algorithms in CHIP-8 but hadn't thought my such
thoughts warranted an article.  I recently read of someone doing something similar, and this changed
my mind.  This other wrote a program to unintelligently and inefficiently find instruction listings.

My approach targets multiplication and uses simple logic to get, I think, a nicer result; in mulling
possible instructions, it's clear which are applicable; if the machine can or can't perform the task
in one instruction it's clear if, say, two are found to suffice they form an answer and upper bound.

Binary shifts give powers of two a clear advantage, and an important aspect of intelligently hacking
machine code is to consider global optimizations; this simple routine is optimal for all such shifts
and it could be optimized further through more indirection to handle register management.  The first
shift is unnecessary; it ultimately clears the register.  These are placed at no particular address:

[32m400-401 [31m1024-1025 [34m▀   ▄▄▄  800E 32782 [33m           ×256[39m V0 ← V0 × 2; VF ← MSB
[32m402-403 [31m1026-1027 [34m▀   ▄▄▄  800E 32782 [33m           ×128[39m V0 ← V0 × 2; VF ← MSB
[32m404-405 [31m1028-1029 [34m▀   ▄▄▄  800E 32782 [33m            ×64[39m V0 ← V0 × 2; VF ← MSB
[32m406-407 [31m1030-1031 [34m▀   ▄▄▄  800E 32782 [33m            ×32[39m V0 ← V0 × 2; VF ← MSB
[32m408-409 [31m1032-1033 [34m▀   ▄▄▄  800E 32782 [33m            ×16[39m V0 ← V0 × 2; VF ← MSB
[32m40A-40B [31m1034-1035 [34m▀   ▄▄▄  800E 32782 [33m             ×8[39m V0 ← V0 × 2; VF ← MSB
[32m40C-40D [31m1036-1037 [34m▀   ▄▄▄  800E 32782 [33m             ×4[39m V0 ← V0 × 2; VF ← MSB
[32m40E-40F [31m1038-1039 [34m▀   ▄▄▄  800E 32782 [33m             ×2[39m V0 ← V0 × 2; VF ← MSB
[32m410-411 [31m1040-1041 [34m▄▄▄ ▄▄▄  00EE 00238                [39m Return

Addition of a register with itself or a lone shift instruction are both suitable for a doubling, but
the latter has the significant advantage of selecting a destination register rather than clobbering.

A simple multiplication by three uses a second register, forming this trivial pair:

[32m200-201 [31m0512-0513 [34m▀   ▄▄▄▀ 810E 33038                [39m V1 ← V0 × 2; VF ← MSB
[32m202-203 [31m0514-0515 [34m▀    ▄ ▀ 8104 33028                [39m V1 ← V1 + V0; VF ← overflow

A multiplication by five is opportunity to use part of the routine and is a mere instruction larger:

[32m200-201 [31m0512-0513 [34m▀      ▀ 8100 33024                [39m V1 ← V0
[32m202-203 [31m0514-0515 [34m  ▀ ▄█   240C 09228                [39m Call ×4
[32m204-205 [31m0516-0517 [34m▀    ▄ ▀ 8104 33028                [39m V1 ← V1 + V0; VF ← overflow

Multiplying by seven is close enough to a power of two for this alternative approach to be pleasant:

[32m200-201 [31m0512-0513 [34m▀      ▀ 8100 33024                [39m V1 ← V0
[32m202-203 [31m0514-0515 [34m  ▀ ▄▀▄  240A 09226                [39m Call ×8
[32m204-205 [31m0516-0517 [34m▀  ▄ ▄ ▄ 8015 32789                [39m V0 ← V0 − V1; VF ← borrow

Multiplying by larger numbers well uses fragmentation; this takes twenty-four as eight plus sixteen:

[32m200-201 [31m0512-0513 [34m▀  ▄     8010 32784                [39m V0 ← V1
[32m202-203 [31m0514-0515 [34m  ▀ ▄▀▄  240A 09226                [39m Call ×8
[32m204-205 [31m0516-0517 [34m▀     ▀  8200 33280                [39m V2 ← V0
[32m206-207 [31m0518-0519 [34m▀  ▄     8010 32784                [39m V0 ← V1
[32m208-209 [31m0520-0521 [34m  ▀ ▄▀   2408 09224                [39m Call ×16
[32m20A-20B [31m0522-0523 [34m▀ ▄  ▄   8024 32804                [39m V0 ← V0 + V2; VF ← overflow

That's no comparison to the much nicer method of splitting twenty-four as eight times three, though:

[32m200-201 [31m0512-0513 [34m▀  ▄▄▄▄  801E 32798                [39m V0 ← V1 × 2; VF ← MSB
[32m202-203 [31m0514-0515 [34m▀  ▄ ▄   8014 32788                [39m V0 ← V0 + V1; VF ← overflow
[32m204-205 [31m0516-0517 [34m  ▀ ▄▀▄  240A 09226                [39m Call ×8

The number forty-three, resulting by adding or subtracting powers of two, should suit as a bad case:

[32m200-201 [31m0512-0513 [34m▀  ▄     8010 32784                [39m V0 ← V1
[32m202-203 [31m0514-0515 [34m  ▀  █▄  2406 09222                [39m Call ×32
[32m204-205 [31m0516-0517 [34m▀     ▀  8200 33280                [39m V2 ← V0
[32m206-207 [31m0518-0519 [34m▀  ▄     8010 32784                [39m V0 ← V1
[32m208-209 [31m0520-0521 [34m  ▀ ▄▀▄  240A 09226                [39m Call ×8
[32m20A-20B [31m0522-0523 [34m▀ ▄  ▄   8024 32804                [39m V0 ← V0 + V2; VF ← overflow
[32m20C-20D [31m0524-0525 [34m▀  ▄ ▄   8014 32788                [39m V0 ← V0 + V1; VF ← overflow
[32m20E-20F [31m0526-0527 [34m▀  ▄ ▄   8014 32788                [39m V0 ← V0 + V1; VF ← overflow
[32m210-211 [31m0528-0529 [34m▀  ▄ ▄   8014 32788                [39m V0 ← V0 + V1; VF ← overflow

Notice three additions is cost just large enough to make other approaches infeasible; multiplication
by four, followed by a subtraction, is one instruction larger, due to the extra register management;
replacing the three additions with one and a shifting results in the same count, and so isn't shown:

[32m200-201 [31m0512-0513 [34m▀  ▄     8010 32784                [39m V0 ← V1
[32m202-203 [31m0514-0515 [34m  ▀  █▄  2406 09222                [39m Call ×32
[32m204-205 [31m0516-0517 [34m▀     ▀  8200 33280                [39m V2 ← V0
[32m206-207 [31m0518-0519 [34m▀  ▄     8010 32784                [39m V0 ← V1
[32m208-209 [31m0520-0521 [34m  ▀ ▄▀▄  240A 09226                [39m Call ×8
[32m20A-20B [31m0522-0523 [34m▀    ▄▀  8204 33284                [39m V2 ← V2 + V0; VF ← overflow
[32m20C-20D [31m0524-0525 [34m▀  ▄     8010 32784                [39m V0 ← V1
[32m20E-20F [31m0526-0527 [34m  ▀ ▄█   240C 09228                [39m Call ×4
[32m210-211 [31m0528-0529 [34m▀ ▄  ▄   8024 32804                [39m V0 ← V0 + V2; VF ← overflow
[32m212-213 [31m0530-0531 [34m▀  ▄ ▄ ▄ 8015 32789                [39m V0 ← V0 − V1; VF ← borrow

Loop adding forty-three would be concise, but inefficient; a table is a good solution yet changes I:

[32m200-201 [31m0512-0513 [34m▀ ▀ ▄ ▀  A208 41480                [39m I ← forty three
[32m202-203 [31m0514-0515 [34m▀▀▀█▄▄▄  F01E 61470                [39m I ← I + V0
[32m204-205 [31m0516-0517 [34m▀██▀ ▄ ▄ F065 61541                [39m Load V0→V0; I ← I + 01
[32m206-207 [31m0518-0519 [34m▄▄▄ ▄▄▄  00EE 00238                [39m Return
[32m208     [31m0520      [34m           00   000 [33m    forty three
[32m209     [31m0521      [34m  █ █ ██   2B   043
[32m20A     [31m0522      [34m █ █ ██    56   086
[32m20B     [31m0523      [34m█      █   81   129
[32m20C     [31m0524      [34m█ █ ██     AC   172
[32m20D     [31m0525      [34m██ █ ███   D7   215[39m

Know the lower registers are rather poor choices for practical routines here, and zero in particular
due to its use by BXXX.  I will provide no general rule for determining efficient multiplication, as
this can easily be done by a human at a whim.  An easy way to get good multiplication chooses a good
number to multiply by.  Making a routine unnecessary is much better than making such more efficient.
.