Title: Altair Assembler Part 3
Date: November 17 2020
Tags: altair programming
========================================

I had intended to write a series of posts documenting the process of writing my
assembler but ended up spending my available time writing the assembler.

The assembler being the largest program I've undertaken on the Altair, I
expected to take two or three years to complete it.  But because of the
pandemic, I've been home a lot more than usual for summertime and that put me in
the mood to write some assembly.

After about a year, and almost 3 Kilobytes of code, I've gotten the assembler
written and working.  I still need to do more exhaustive testing but the
problems are solved, sub-routines are written, and it will assemble a small
program start to finish.  I'll use this post to summarize the project.

To remind everyone, the point of this project was to write the minimum features
necessary to eliminate human error and allow me to write a real, full featured,
assembler.  The primary needs were translating assembler mnemonics to opcodes,
keeping count of the addresses, and allowing the use of labels.  As I went, I
ended up filling in many more additional features.  I figured that I could write
the algorithms now or in the full assembler later and it's "just a little more
code" so why not now?  And if something took too long to get right, I could just
comment it out and move on.

I also wanted this project to be a programming challenge so I tried to solve all
of the problems on my own.  That is, parsing a line of assembly code, storing
structured data in memory, searching, error checking, etc.  Other than a 20 year
old Computer Science education, the only "cheating" was taking a quick glance at
Hjalfi's CP/M assembler[0] written in C and seeing how he used function
callbacks for each opcode.  I had thought I would need to to some complex
decoding and branching based on the opcode's bits.  I probably could have done
some of that to reduce the amount of code, but by the time I got this far I was
just trying to get stuff written.  There was a lot more copying and pasting with
minor tweaking than I usually like to do.  That's why the assembler is so big
despite a lack of larger features found in other assemblers that aren't much
larger in size.  The other trick I picked up was to grow the symbol table down
like a stack.  I got that idea while watching a video about C64 assembly on
8-bit Show and Tell's YouTube channel[1] but I can't recall exactly which one.
That idea made sense as a way to maximize available memory for the user's
program without trying to guess a limit to the size of the symbol table.  I also
plan to move the assembler into a PROM chip in high memory where there won't be
room, and it won't be writable.


# Assembler Features #

Refer back to Part 1[2] of this series to see what I planned to do and not do in
this first pass.  The end result is a bit different.

Obviously, the assembler handles translating mnemonics to opcodes, covering the
entire 8080 instruction set.  I had to handle opcodes that take a register as a
parameter but is part of the opcode, takes data, addresses, etc.  That's where
the callbacks came in.

It also handles address counting.  It supports the use of the ORG pseudo-opcode
to set the address counter to a specific address to start assembly or to
continue assembling from.  And it tracks the count as opcodes of 1, 2 or 3 bytes
are processed.

You can create labels which will save the current address to the symbol table to
be referenced by other instructions.  And, bonus, you can create a label on a
line by itself allowing for multiple labels for a single address.

You can reference undefined labels, as long as they get defined later.
Undefined label references are stored and at the end of assembly, the references
are resolved by searching for the labels in the symbol table.

DW, DB, and DS pseudo-opcodes are implemented for data storage and can be
labeled and referenced by that label.  I even managed to support strings with DB
including some escaped, non-printable characters like tab and newline.

Both the EQU and SET pseudo-opcodes are implemented.  EQU definitions cannot be
changed and will be an error if you try.  SET definitions can be changed and the
assembler uses whatever the last value was when you reference it.  Unlike
labels, both need to be defined before they are referenced.

Numeric values can be entered as decimal, binary, hex, single word octal or 2
byte octal.  That last one is quirky, but important.  I've done most of my
manual addressing as 2 octal bytes because when you reference an address in a
CALL or JMP, the address is broken into 2 bytes and octal has a bad habit of
changing when represented as a single 16-bit word versus 2 8-bit bytes.  For
example, 123456Q as a 16-bit word becomes 247Q 056Q as 2 bytes.  It makes it
easier if I can be consistent with how I have been counting up until now.
Although, it's about time I switch to using hexadecimal anyway.

The code can be commented.  Anything after a ';' until the end of the line is
ignored.


# Missing Features #

I still didn't get all the bells and whistles in on this round.  I had to draw a
line somewhere and I had blown way past it already.

Formatting is very strict.  Optional label, one tab, opcode, one tab, comma
separated args.  The args can be followed by any garbage you want (see below).
You can't use multiple tabs or spaces and all entered characters are
automatically uppercased before being stored.

Robust parsing.  Once the parser sees all the fields it needs, it stops looking.
You could, by mistake, provide 2 args to an opcode that requires one and the
second one will be silently ignored.  The program might work the way you want
but the written code won't make sense.

Robust error reporting.  It will catch most formatting errors, referencing
undefined symbols, redefining an EQU, etc, but no detail is provided regarding
what line, or exactly where the error is.  Right now, the code is echoed as it
is being streamed in so you'll see where assembly stops.  Messages are terse to
save space.

IF/ENDIF pseudo-opcodes aren't implemented.  I haven't felt the need for them.
It may be a convenient way to comment out code, toggle debug code, or support
multiple hardware configurations but I can use ';' to comment out code or debug
instructions, and I only need to support my own hardware.

MACROs were a "no way".  I wasn't sure how to handle MACROS yet, especially
MACROS with parameters which would be cool.  I'm saving this one for later.

It does not support mathematical expressions.  Besides chars, number
conversions, and strings, you can't get fancy with argument values.  Basically,
I'd have to write a full 16-bit multi-function calculator.  This might be the
next thing I do so I can include it in the assembler but too much extra work for
my needs right now.

Separate from error reporting, it also doesn't do much error checking.  For
example, you can grow the symbol table down over your own program or the
assembler itself, if it's in memory below the symbol table.  You can ORG into
the symbol table or assembler and overwrite it that way.  You can ORG backwards
and overwrite your own program you're assembling.  All the memory is yours to
abuse, the assembler won't be looking out for you.  Burning the assembler to a
PROM will at least protect that from being wiped out.  I also don't check for
buffer over runs so you can probably cause wackiness by entering really long
labels or something.

This doesn't act like a real assembler.  You're expected to type each line of
assembly, or in the modern day cheat and stream it from a terminal emulator.
There is no reading code from storage, or memory, no saving the symbol table,
and no saving the binary output, except directly into memory.  File IO is on the
short list of things to get to.  Also an editor to make a full development
environment.


# Known Bugs #

Besides the above shortcomings, there are also a few bugs I already know I'll
need to fix.  I didn't yet, mostly because adding code would require me to, by
hand, re-address over 2 thousand lines of code and update all the references.  I
should probably use this new assembler to do that for me.  I had been adding new
sub-routines near the end of the code.

When entering a line of code, typing a tab, which is the required file
separator, advances the cursor 4 spaces in your terminal, but backspacing or
deleting it only moves the cursor back one space.  I reused an old subroutine
and somehow I never caught that until recently.  It doesn't break data entry,
but it can be confusing if you try to delete more of the visual spaces and in
memory you're deleting actual characters which will still be visible in the
terminal.

You can't use special escaped characters in expressions, only within strings
(which is limited only to the DB pseudo-opcode).  You might want to CPI '\0' to
check for a null terminating byte at the end of a string but you'll need to use
a numeric value instead.  Printable literal chars work, though.  That will be an
easy copy and paste from the string subroutine to implement.

All characters are uppercased automatically when entered.  I did this just to
simplify opcode and symbol lookup but neglected to think that you might want to
enter case sensitive strings as data.  Oops.

I got lazy and it doesn't check data length against an expected size.  If you
enter a hex word of FFFFH when a byte is needed, it will silently truncate to
the last byte, FFH.  If a word is needed and you enter less than that, it will
prefix it with a 0 as many zeros as it needs.  I would consider this a feature,
though, but it was worth mentioning the behavior.

Similarly, it doesn't check numeric input length so if you make a mistake and
use too many digits, you silently loose the most significant digits that didn't
fit.  Too many bits in that 16-bit binary word and you might not notice you list
the most significant bit.  I also imagine it's going to be really easy to
mistakenly enter a 000 377Q as a 2 byte octal word and end up with 000277Q in
memory because I parse it as 2 separate bytes and truncate each one separately.
I know how to fix this, but didn't go back and implement it that way.  I should
really be checking syntax better overall, anyway.


# Up Next #

The next things I need to do, besides more testing, is to reassemble into higher
memory so you can write programs using interrupts and be able to start at
address zero and have as much memory real estate as possible.  This will require
me to rewrite the assembler code in this assembler's assembly syntax.

The big features I want to add are simple calculations so you can reference
memory offsets and such, and MACROs to better organize and reuse code.  I think.
I'm not sure I'm sold on the necessity of MACROs, yet.

I'd like to revisit a number of design decisions, try other algorithms, clean up
the code, optimize for code reuse and make it more user friendly.

Alternatively, I could just claim victory here, say I can write an assembler
from scratch by hand and know that I could add more to it if I wanted to and
instead, just use one off the shelf from the era that is more compact and
featureful.

It's going to depend on my mood, I guess.  And if I can find an off the shelf
assembler I like.


[0] http://cowlark.com/2019-06-01-cpm-asm/
[1] https://www.youtube.com/c/8BitShowAndTell/
[2] gopher://kagu-tsuchi.com:70/0/blog/articles/altair_assembler_part_1.html