Accelerating Pygmy Forth's editor with direct video writes ========================================================== I like to use Pygmy Forth on my HP 200LX. It is a complete development environment, containing interpreter, assembler and editor in about 18 KB. Its space-compactness is on par with the original Turbo Pascal. One nice thing about this Forth is that it includes full source code. This eclipses the small size of the executable (as the source is 200K and the comments another 200K)... but saving room for the source is well worth it, because the system is self-hosting. You can change the source, then invoke the metacompiler so the system spits out a near-clone of itself that incorporates your desired changes :D Something that bothers me about this Forth is the slow speed of its editor on the 200LX. This is a traditional "screen editor" that displays a 64x16 character (1024 byte) block. Carriage returns and control characters are not interpreted, instead each line is whitespace-padded to 64 characters, with a hard-coded word wrap after the 64th character. This is a Forth tradition that makes the editor simpler to write at the cost of wasted whitespace in screen files. Despite the editor's simplicity, it's very slow. Drawing a 64x16 block takes about a full second, as lines paint one-by-one from the top to the bottom of the screen. This isn't the hardware's fault -- the 200LX has a nice snappy text editor in ROM, and Turbo Pascal (and even XVI) don't have this problem. So, the problem must be something in Pygmy Forth. Let's dive into its source and see if we can find out what's going on. Screen 103 contains the core of the editor in Pygmy 1.7: ---------------------------------------------------------------- ( Editor ) | | VARIABLE INS ( insert or overwrite flag) | | VARIABLE XIN | VARIABLE #CUTS | | : .H ( -) CUR@ 0 0 AT ." scr # " SCR @ DUP . >UNIT# .FILE | ." find(3,1) rep(4,2) del(5) join(6) cut(7,8) " | INS @ IF ." i c=" ELSE ." c=" THEN #CUTS ? AT ; | | : L1 ( -) SCR @ BLOCK EBUF ! .H ; | | | : L2 ( -) CUR@ 1 0 AT EBUF @ 64 FOR 45 EMIT NEXT CR | 16 FOR 64 FOR C@+ EMIT NEXT ." |" CR NEXT DROP ( ) | 64 FOR 45 EMIT NEXT AT ; | | | : L ( -) L1 L2 ; | | | | ---------------------------------------------------------------- Here, .H prints the header row at the top of the screen, (0, 0). L1 fetches the address of the current screen block then calls .H to print the header. L2 does the real work of printing the 64x16 text area: 64 FOR 45 EMIT NEXT -- print the top border, a row of -'s. CR -- moves to the next line. 16 FOR ... NEXT DROP -- this loop prints the 16 rows. 64 FOR ... NEXT -- this loop prints the 64 columns. C@+ EMIT -- prints the next character in the block. ." |" -- prints the row's right border, a |. CR -- moves to the next line 64 FOR 45 EMIT NEXT -- prints the bottom border, a row of -'s. There is also some cursor handling code scattered around L2. The word CUR@ ( -- row col) returns the current cursor position. The word AT ( row col --) moves the cursor to the specified position. CUR@ ... AT -- at the start of L2, save original cursor position, then restore it at the end. 1 0 AT -- start the 64x16 area on line 2. While L2 is running, each EMIT, ." and CR moves the cursor. This technique is similar to the old familiar gotoXY() and write() approach often used to draw text interfaces in Pascal and other languages. In this Forth, EMIT is a vectored word so the Forth system can be more easily retargeted to different computers. SEE EMIT (EMIT ok The vector currently points at the word (EMIT. ---------------------------------------------------------------- ( BIOS Int $10 video functions ) | | CODE (AT ( row col -) BL DL MOV, BX POP, BL DH MOV, | BX BX SUB, $0200 #, AX MOV, $10 #, INT, BX POP, NXT, END-CODE | | CODE (CUR@ ( - row col) | BX PUSH, BX BX SUB, $0300 #, AX MOV, $10 #, INT, BX BX SUB,| DL BL MOV, DL DL SUB, DH DL XCHG, DX PUSH, NXT, END-CODE | | CODE (EMIT ( c -) BX AX MOV, $0E #, AH MOV, BH BL MOV, | $10 #, INT, BX POP, NXT, END-CODE | | | | | | ---------------------------------------------------------------- (EMIT is using a BIOS INT 10h video function for maximum compatibility, at the cost of speed. Maybe this is why the editor is so slow? Indeed, Pygmy Forth comes with an option (see screens 197-200) that reconfigures EMIT and other vector words to use direct memory video output, then outputs the modified Forth system as DFORTH.COM so you can tell it apart from the original one :D. Is this faster? Yes, but unfortunately, not very much. There is a noticeable speedup on the HP 200LX, but we're still stuck watching the screen paint line-by-line. It's much slower than other editors. Can we do better? Of course! All the source code is right here. It turns out that a big reason for the slowness is all of the useless cursor movement. EMIT is great if you want scrolling teletype-style output, but the block editor doesn't work like that: all it does is print a 64x16 block, which definitely fits on our 80x25 screen. It should display a block as quickly as possible at a fixed screen location. It doesn't need a cursor, or scrolling. EMIT is doing a lot of extra work that we don't need! (I'll let you look at the source code for direct video EMIT yourself, if you want to (see screens 198-199). The (DEMIT word itself is two screens long, and then the editor calls it 64x16 (1024) times to display each character, so it's easy to see why this might be slow.) To eliminate all the extra work, we will write an optimized replacement for the editor's "L2" word. First, edit screen 103 to make a vector word for "L2" so we can override it later. This is not strictly necessary, but it lets us switch between the "slow" and "optimized" version of the editor at will, which is handy if we break the editor while we are in the middle of using it to edit itself :D ---------------------------------------------------------------- ( Editor ) | | VARIABLE INS ( insert or overwrite flag) | | VARIABLE XIN | VARIABLE #CUTS | | : .H ( -) CUR@ 0 0 AT ." scr # " SCR @ DUP . >UNIT# .FILE | ." find(3,1) rep(4,2) del(5) join(6) cut(7,8) " | INS @ IF ." i c=" ELSE ." c=" THEN #CUTS ? AT ; | | : L1 ( -) SCR @ BLOCK EBUF ! .H ; | | DEFER ED-L2 ( routine to draw the 64x16 area + borders) | | : L3 ( a-) 16 FOR 64 FOR C@+ EMIT NEXT ." |" CR NEXT DROP ; | | : L2 ( -) CUR@ 1 0 AT EBUF @ 64 FOR 45 EMIT NEXT CR | L3 18 0 AT 64 FOR 45 EMIT NEXT AT ; | | | : L ( -) L1 ED-L2 ; | ' L2 IS ED-L2 | | ---------------------------------------------------------------- ED-L2 is the vector word we can modify later. (You'll also see I factored L2's 64x16 drawing code into a separate word L3, but this change is unnecessary because L3 won't be used at all once we're finished overriding L2 :D) Here is the replacement for L2: ---------------------------------------------------------------- ( Editor - fast direct-video display without cursor movement) | CODE (FAST-L2) ( afrom ato -- ) CLD, | $B800 #, DX MOV, DX ES MOV, ( video segment) | SI DX MOV, ( save reg.) | SI POP, BX DI MOV, ( arguments ) | 64 #, CX MOV, $072D #, AX MOV, REP, AX STOS, ( ------) | 16 #, CX MOV, $20 #, AL MOV, REP, AX STOS, ( spaces) | 16 #, CX MOV, BEGIN, CX BX MOV, | 64 #, CX MOV, BEGIN, AL LODS, AX STOS, LOOP, | $7C #, AL MOV, AX STOS, ( |) | 15 #, CX MOV, $20 #, AL MOV, REP, AX STOS, ( spaces) | BX CX MOV, LOOP, | 64 #, CX MOV, $2D #, AL MOV, REP, AX STOS, ( ------) | 16 #, CX MOV, $20 #, AL MOV, REP, AX STOS, ( spaces) | DX SI MOV, ( restore reg.) BX POP, ( fix up TOS) | NXT, END-CODE | ---------------------------------------------------------------- This code works by copying characters from the code BLOCK location directly into video memory. 16-bit x86 processors use segmented memory to bypass the 64kb boundary. CGA+ text mode video memory is at segment $B800, which is not the same segment where the code BLOCK lives -- so the copy must span segments. Inline assembly is the easiest way of accessing an alternate segment in Pygmy Forth, and is best for speed anyhow, because we can use the x86 string instructions to copy memory with minimum overhead. The x86 string instructions by default copy from segment DS to segment ES. So by setting the ES register we'll be able to copy into video memory. The instructions: $B800 #, DX MOV, -- in Intel syntax would be MOV DX,B800h DX ES MOV, -- in Intel syntax would be MOV ES,DX set the ES segment register to $B800. Note we can't set this register directly, instead we have to set a different register and then transfer its value to ES. Pygmy Forth's calling convention requires registers BX, SI, SP, BP, CS, DS, and SS to be preserved in assembly CODE functions. The x86 string instructions use registers SI and DI, so we need to save SI and restore it at the end of the function. SI DX MOV, ... DX SI MOV, -- restore register The (FAST-L2) word takes two arguments: the source address and destination address. Pygmy Forth stores the top-of-stack item in BX, and stores remaining items on the x86 hardware stack. We want the source address in SI and the destination address in DI. Thus, the following code: SI POP, -- 2nd stack item --> SI BX DI MOV, -- top-of-stack --> DI ... BX POP, -- the (FAST-L2) word consumes both its arguments, so pop a second time from the hardware stack (into BX) to consume the second argument and restore TOS. Waiting until the end of the word to restore BX also means we can use BX as a scratch register in the meantime, which we'll do :D The rest of the word consists of calls to the x86 string instructions. However, we cannot copy ASCII text directly into the video buffer and expect it to work. On the CGA+, each ASCII character is followed by an "attribute" byte that determines its foreground and background color and other attributes, such as whether it should blink. - attribute 07: 0 = black background, 7 = white foreground / / / 'A' / 'B' / 'C' / / / / / / / 41 07 42 07 43 07 AL AH | AX | Fortunately, 16-bit x86 string instructions let us write to memory two bytes at a time via the AX register. Because x86 is little-endian, AL (the lower half of the register) is stored in memory first and holds the ASCII character. AH holds the attribute bit. You are now prepared to understand the following loop: 64 #, CX MOV, -- in Intel syntax would be MOV CX, 64 $072D #, AX MOV, -- in Intel syntax would be MOV AX,072Dh REP, AX STOS, -- in Intel syntax would be REP STOSW Setting CX to 64 means we'll loop 64 times. Setting AX to $072D is equivalent to setting AH=$07 and AL=$2D. $07 is the attribute code for white-on-black text, and $2D is the ASCII code for hyphen. Finally, REP STOSW kicks off the copy. So this code displays a line of 64 hyphens, which is the top border of the editor window. A similar call displays 16 spaces after the row of 64 hyphens. Note that we only have to change AL this time, because AH is still set to attribute $07: 16 #, CX MOV, $20 #, AL MOV, REP, AX STOS, ( spaces) Our screen width is 80 columns, and so far we've displayed 80 characters, so by now DI has wrapped to the start of the next line of video memory and we are ready to print the next line without further ado. No cursor is necessary. Borders aside, the remaining, innermost block of code is the meat of L2 that displays the 64x16 screen and the right-hand border. I'll add line numbers for ease of reference. 1: 16 #, CX MOV, BEGIN, CX BX MOV, 2: 64 #, CX MOV, BEGIN, AL LODS, AX STOS, LOOP, 3: $7C #, AL MOV, AX STOS, 4: 15 #, CX MOV, $20 #, AL MOV, REP, AX STOS, 5: BX CX MOV, LOOP, Lines 1 and 5 are essentially a 16 FOR ... NEXT loop. Line 2 is essentially a a 64 FOR ... NEXT loop. This is how the 64x16 screen gets drawn. The x86 LOOP instruction uses CX as the loop counter, so it's necessary to save it at the start and restore it at the end of the outer loop (lines 1 and 5). On Line 2, the pattern AL LODS, AX STOS, loads one byte from the code BLOCK at DS:SI, stores it to AL, then writes AX to video memory at ES:DI. This is how each lone ASCII byte in the code BLOCK gets transformed to two bytes (character, attribute) in video RAM. Line 3 prints the "|" right-hand border at the end of the 64-character line. Because of this, Line 4 only has to print 15 (not 16) spaces to carry us to the start of the next line of video memory. That's the optimized routine! At this point, all we need is a bit of glue code to get its signature to match the editor's original L2. And then we can install it by re-vectoring ED-L2. ---------------------------------------------------------------- ( Editor - install fast display hook) | | : FAST-L2 ( -) | EBUF @ 160 (FAST-L2) ; | | 103 114 THRU ( reload editor) ( XX) | : FAST-EDITOR ['] FAST-L2 IS ED-L2 ; | : SLOW-EDITOR ['] L2 IS ED-L2 ; ( XX) | | FAST-EDITOR | | ( Note: This is the installation block for testing the editor.) | ( If you want to permanently metacompile it in, delete lines) | ( marked XX first.) | | | ---------------------------------------------------------------- How much does this help? Benchmark: 1 EDIT (to load the editor on screen 1), page-down repeatedly until reaching screen 10, then change the first character to 'X': PYGMY.COM: 19.5 seconds. DPYGMY.COM: 13.5 seconds. My editor: 2.0 seconds. Wow! In practice, this feels "instantaneous", same as any other editor on the HP 200LX -- and same as Pygmy's typical performance on my 486 and in DOSBox. Mission accomplished :D