Accelerating Pygmy Forth's editor with direct video writes
==========================================================

I like to use Pygmy Forth on my HP 200LX.  It is a complete development
environment, containing interpreter, assembler and editor in about
18 KB.  Its space-compactness is on par with the original Turbo Pascal.

One nice thing about this Forth is that it includes full source code.
This eclipses the small size of the executable (as the source is
200K and the comments another 200K)... but saving room for the source
is well worth it, because the system is self-hosting.  You can change
the source, then invoke the metacompiler so the system spits out a
near-clone of itself that incorporates your desired changes :D

Something that bothers me about this Forth is the slow speed of its
editor on the 200LX.  This is a traditional "screen editor" that
displays a 64x16 character (1024 byte) block.  Carriage returns
and control characters are not interpreted, instead each line is
whitespace-padded to 64 characters, with a hard-coded word wrap after
the 64th character.  This is a Forth tradition that makes the editor
simpler to write at the cost of wasted whitespace in screen files.

Despite the editor's simplicity, it's very slow.  Drawing a 64x16
block takes about a full second, as lines paint one-by-one from the
top to the bottom of the screen.  This isn't the hardware's fault --
the 200LX has a nice snappy text editor in ROM, and Turbo Pascal
(and even XVI) don't have this problem.  So, the problem must be
something in Pygmy Forth.  Let's dive into its source and see if we
can find out what's going on.

Screen 103 contains the core of the editor in Pygmy 1.7:
----------------------------------------------------------------
( Editor        )                                               |
| VARIABLE INS ( insert or overwrite flag)                      |
| VARIABLE XIN   | VARIABLE #CUTS                               |
| : .H ( -) CUR@  0 0 AT ." scr # " SCR @ DUP .   >UNIT# .FILE  |
    ."   find(3,1) rep(4,2) del(5) join(6) cut(7,8) "           |
    INS @ IF ." i c=" ELSE  ."   c=" THEN #CUTS ?   AT ;        |
| : L1 ( -)  SCR @ BLOCK EBUF !     .H  ;                       |
                                                                |
| : L2 ( -) CUR@ 1 0 AT   EBUF @  64 FOR  45 EMIT NEXT CR       |
    16 FOR  64 FOR C@+ EMIT  NEXT ." |" CR NEXT DROP (  )       |
    64 FOR  45 EMIT NEXT  AT  ;                                 |
                                                                |
| : L ( -)  L1  L2  ;                                           |
                                                                |
                                                                |
                                                                |
----------------------------------------------------------------

Here, .H prints the header row at the top of the screen, (0, 0).
L1 fetches the address of the current screen block then calls .H to
print the header.  L2 does the real work of printing the 64x16 text
area:

    64 FOR  45 EMIT NEXT     -- print the top border, a row of -'s.
    CR                       -- moves to the next line.
    16 FOR ... NEXT DROP     -- this loop prints the 16 rows.
        64 FOR ... NEXT      -- this loop prints the 64 columns.
            C@+ EMIT         -- prints the next character in the block.
        ." |"                -- prints the row's right border, a |.
        CR                   -- moves to the next line
    64 FOR  45 EMIT NEXT     -- prints the bottom border, a row of -'s.

There is also some cursor handling code scattered around L2.  The word
CUR@ ( -- row col) returns the current cursor position.  The word AT
( row col --) moves the cursor to the specified position.

    CUR@ ... AT             -- at the start of L2, save original cursor
                               position, then restore it at the end.
    1 0 AT                  -- start the 64x16 area on line 2.

While L2 is running, each EMIT, ." and CR moves the cursor.  This
technique is similar to the old familiar gotoXY() and write() approach
often used to draw text interfaces in Pascal and other languages.

In this Forth, EMIT is a vectored word so the Forth system can be
more easily retargeted to different computers.

SEE EMIT
(EMIT
ok

The vector currently points at the word (EMIT.
----------------------------------------------------------------
( BIOS Int $10 video functions )                                |
                                                                |
CODE (AT  ( row col -) BL DL MOV, BX POP, BL DH MOV,            |
 BX BX SUB, $0200 #, AX MOV,  $10 #, INT, BX POP, NXT, END-CODE |
                                                                |
CODE (CUR@ ( - row col)                                         |
  BX PUSH,  BX BX SUB,  $0300 #, AX MOV,  $10 #, INT, BX BX SUB,|
  DL BL MOV, DL DL SUB, DH DL XCHG, DX PUSH, NXT, END-CODE      |
                                                                |
CODE (EMIT ( c -)  BX AX MOV, $0E #, AH MOV, BH BL MOV,         |
 $10 #, INT,  BX POP,  NXT,   END-CODE                          |
                                                                |
                                                                |
                                                                |
                                                                |
                                                                |
----------------------------------------------------------------

(EMIT is using a BIOS INT 10h video function for maximum compatibility,
at the cost of speed.  Maybe this is why the editor is so slow?
Indeed, Pygmy Forth comes with an option (see screens 197-200) that
reconfigures EMIT and other vector words to use direct memory video
output, then outputs the modified Forth system as DFORTH.COM so you
can tell it apart from the original one :D.

Is this faster?  Yes, but unfortunately, not very much.  There is a
noticeable speedup on the HP 200LX, but we're still stuck watching
the screen paint line-by-line.  It's much slower than other editors.

Can we do better?  Of course!  All the source code is right here.

It turns out that a big reason for the slowness is all of the useless
cursor movement.  EMIT is great if you want scrolling teletype-style
output, but the block editor doesn't work like that:  all it does
is print a 64x16 block, which definitely fits on our 80x25 screen.
It should display a block as quickly as possible at a fixed screen
location.  It doesn't need a cursor, or scrolling.  EMIT is doing a
lot of extra work that we don't need!

(I'll let you look at the source code for direct video EMIT yourself,
if you want to (see screens 198-199).  The (DEMIT word itself is two
screens long, and then the editor calls it 64x16 (1024) times to
display each character, so it's easy to see why this might be slow.)

To eliminate all the extra work, we will write an optimized replacement
for the editor's "L2" word.

First, edit screen 103 to make a vector word for "L2" so we can
override it later.  This is not strictly necessary, but it lets us
switch between the "slow" and "optimized" version of the editor at
will, which is handy if we break the editor while we are in the middle
of using it to edit itself :D
----------------------------------------------------------------
( Editor        )                                               |
| VARIABLE INS ( insert or overwrite flag)                      |
| VARIABLE XIN   | VARIABLE #CUTS                               |
| : .H ( -) CUR@  0 0 AT ." scr # " SCR @ DUP .   >UNIT# .FILE  |
    ."   find(3,1) rep(4,2) del(5) join(6) cut(7,8) "           |
    INS @ IF ." i c=" ELSE  ."   c=" THEN #CUTS ?   AT ;        |
| : L1 ( -)  SCR @ BLOCK EBUF !     .H  ;                       |
                                                                |
DEFER ED-L2  ( routine to draw the 64x16 area + borders)        |
| : L3 ( a-) 16 FOR 64 FOR C@+ EMIT NEXT ." |" CR NEXT DROP ;   |
| : L2 ( -) CUR@ 1 0 AT   EBUF @  64 FOR  45 EMIT NEXT CR       |
            L3   18 0 AT          64 FOR  45 EMIT NEXT  AT  ;   |
                                                                |
| : L ( -)  L1  ED-L2 ;                                         |
' L2 IS ED-L2                                                   |
                                                                |
----------------------------------------------------------------

ED-L2 is the vector word we can modify later.

(You'll also see I factored L2's 64x16 drawing code into a separate
word L3, but this change is unnecessary because L3 won't be used at
all once we're finished overriding L2 :D)

Here is the replacement for L2:
----------------------------------------------------------------
( Editor - fast direct-video display without cursor movement)   |
CODE (FAST-L2) ( afrom ato -- )   CLD,                          |
  $B800 #, DX MOV,   DX ES MOV, ( video segment)                |
  SI DX MOV,                    ( save reg.)                    |
  SI POP,  BX DI MOV,           ( arguments )                   |
  64 #, CX MOV, $072D #, AX MOV,   REP, AX STOS,  ( ------)     |
  16 #, CX MOV,   $20 #, AL MOV,   REP, AX STOS,  ( spaces)     |
  16 #, CX MOV,   BEGIN,   CX BX MOV,                           |
    64  #, CX MOV,   BEGIN,   AL LODS,   AX STOS,   LOOP,       |
                   $7C #, AL MOV,        AX STOS,   (      |)   |
    15  #, CX MOV, $20 #, AL MOV,   REP, AX STOS,   ( spaces)   |
    BX CX MOV,   LOOP,                                          |
  64 #, CX MOV,   $2D #, AL MOV,   REP, AX STOS,  ( ------)     |
  16 #, CX MOV,   $20 #, AL MOV,   REP, AX STOS,  ( spaces)     |
  DX SI MOV,   ( restore reg.)  BX POP,   ( fix up TOS)         |
NXT, END-CODE                                                   |
----------------------------------------------------------------

This code works by copying characters from the code BLOCK location
directly into video memory.  16-bit x86 processors use segmented memory
to bypass the 64kb boundary.  CGA+ text mode video memory is at segment
$B800, which is not the same segment where the code BLOCK lives --
so the copy must span segments.  Inline assembly is the easiest way
of accessing an alternate segment in Pygmy Forth, and is best for
speed anyhow, because we can use the x86 string instructions to copy
memory with minimum overhead.

The x86 string instructions by default copy from segment DS to
segment ES.  So by setting the ES register we'll be able to copy into
video memory.

The instructions:
    $B800 #, DX MOV,    -- in Intel syntax would be MOV DX,B800h
    DX ES MOV,          -- in Intel syntax would be MOV ES,DX

set the ES segment register to $B800.  Note we can't set this register
directly, instead we have to set a different register and then transfer
its value to ES.

Pygmy Forth's calling convention requires registers BX, SI, SP, BP,
CS, DS, and SS to be preserved in assembly CODE functions.  The x86
string instructions use registers SI and DI, so we need to save SI
and restore it at the end of the function.

    SI DX MOV,  ...  DX SI MOV,   -- restore register

The (FAST-L2) word takes two arguments: the source address and
destination address.  Pygmy Forth stores the top-of-stack item in
BX, and stores remaining items on the x86 hardware stack.  We want
the source address in SI and the destination address in DI.  Thus,
the following code:

    SI POP,       -- 2nd stack item --> SI
    BX DI MOV,    -- top-of-stack   --> DI
    ...
    BX POP,       -- the (FAST-L2) word consumes both its arguments,
                     so pop a second time from the hardware stack
                     (into BX) to consume the second argument and
                     restore TOS.

                     Waiting until the end of the word to restore
                     BX also means we can use BX as a scratch
                     register in the meantime, which we'll do :D

The rest of the word consists of calls to the x86 string instructions.
However, we cannot copy ASCII text directly into the video buffer and
expect it to work.  On the CGA+, each ASCII character is followed by an
"attribute" byte that determines its foreground and background color
and other attributes, such as whether it should blink.

            - attribute 07:  0 = black background, 7 = white foreground
           /       /       /
     'A'  /  'B'  /  'C'  / 
     /   /   /   /   /   /
    41  07  42  07  43  07
    AL  AH
    | AX |

Fortunately, 16-bit x86 string instructions let us write to memory two
bytes at a time via the AX register.  Because x86 is little-endian,
AL (the lower half of the register) is stored in memory first and
holds the ASCII character.  AH holds the attribute bit.  You are now
prepared to understand the following loop:
    64 #, CX MOV,     -- in Intel syntax would be MOV CX, 64
    $072D #, AX MOV,  -- in Intel syntax would be MOV AX,072Dh
    REP, AX STOS,     -- in Intel syntax would be REP STOSW

Setting CX to 64 means we'll loop 64 times.  Setting AX to $072D
is equivalent to setting AH=$07 and AL=$2D.  $07 is the attribute
code for white-on-black text, and $2D is the ASCII code for hyphen.
Finally, REP STOSW kicks off the copy.  So this code displays a line
of 64 hyphens, which is the top border of the editor window.

A similar call displays 16 spaces after the row of 64 hyphens.
Note that we only have to change AL this time, because AH is still
set to attribute $07:
    16 #, CX MOV,   $20 #, AL MOV,   REP, AX STOS,  ( spaces)

Our screen width is 80 columns, and so far we've displayed 80
characters, so by now DI has wrapped to the start of the next line
of video memory and we are ready to print the next line without
further ado.  No cursor is necessary.

Borders aside, the remaining, innermost block of code is the meat
of L2 that displays the 64x16 screen and the right-hand border.
I'll add line numbers for ease of reference.

1:  16 #, CX MOV,   BEGIN,   CX BX MOV,
2:    64  #, CX MOV,   BEGIN,   AL LODS,   AX STOS,   LOOP,
3:                   $7C #, AL MOV,        AX STOS,
4:    15  #, CX MOV, $20 #, AL MOV,   REP, AX STOS,
5:    BX CX MOV,   LOOP,

Lines 1 and 5 are essentially a 16 FOR ... NEXT loop.  Line 2 is
essentially a a 64 FOR ... NEXT loop.  This is how the 64x16 screen
gets drawn.  The x86 LOOP instruction uses CX as the loop counter,
so it's necessary to save it at the start and restore it at the end
of the outer loop (lines 1 and 5).

On Line 2, the pattern AL LODS,  AX STOS, loads one byte from the
code BLOCK at DS:SI, stores it to AL, then writes AX to video memory
at ES:DI.  This is how each lone ASCII byte in the code BLOCK gets
transformed to two bytes (character, attribute) in video RAM.

Line 3 prints the "|" right-hand border at the end of the 64-character
line.  Because of this, Line 4 only has to print 15 (not 16) spaces
to carry us to the start of the next line of video memory.

That's the optimized routine!

At this point, all we need is a bit of glue code to get its signature
to match the editor's original L2.  And then we can install it by
re-vectoring ED-L2.
----------------------------------------------------------------
( Editor - install fast display hook)                           |
                                                                |
: FAST-L2 ( -)                                                  |
    EBUF @   160   (FAST-L2) ;                                  |
                                                                |
103 114 THRU ( reload editor)                             ( XX) |
: FAST-EDITOR   ['] FAST-L2 IS ED-L2 ;                          |
: SLOW-EDITOR   ['] L2      IS ED-L2 ;                    ( XX) |
                                                                |
FAST-EDITOR                                                     |
                                                                |
( Note: This is the installation block for testing the editor.) |
( If you want to permanently metacompile it in, delete lines)   |
( marked XX first.)                                             |
                                                                |
                                                                |
----------------------------------------------------------------

How much does this help?

Benchmark:
1 EDIT (to load the editor on screen 1), page-down repeatedly until
reaching screen 10, then change the first character to 'X':

    PYGMY.COM:  19.5 seconds.
    DPYGMY.COM: 13.5 seconds.
    My editor:   2.0 seconds.

Wow!  In practice, this feels "instantaneous", same as any other
editor on the HP 200LX -- and same as Pygmy's typical performance on
my 486 and in DOSBox.  Mission accomplished :D