Just a devlog - Forth innards exposed

Forth innards exposed

Tags:
language
forth

I continued playing with Forth. TLDR: Here the source and here the live demo. Here on the description of what I have done in order to make it Turing complete. In hindsight I should had divided the work better, in more regular chunks, but now it’s done.

Last on previous post, we got to define new words using existing words by the compile(), but we can not define new ones within our interpreter yet. One step to achieve that is to have means of switching between compile and interpreter mode. To understand this let’s first see a limitation with our current compiler.

Let’s create an word to add the . ascii value on stack:

define('DOT', compile('CHAR .')) // Won't work!

It won’t work as we wanted: It will try to get the current input buffer when it is called, for example DOT + will put + on the stack and then print its value as number. We need some way to make CHAR to be executed on compilation time and make its value available on run time. For this we have these new words:

// ( C: x -- ) ( -- x )
// Pos value at compile time and makes current definition push it 
// on its run time
define('LITERAL', () => { compiler.lit(pop()) }).immediate = true

// ( -- ) Exits compilation mode
define('[', () => compiling = false).immediate = true

// ( -- ) Enters compilation mode
define(']', () => compiling = true)

With this, we can temporarily go to execution mode execute some words and then go back to compile mode and append its result to it:

define('DOT', compile('[ CHAR . ] LITERAL') // This works

Now it works like we wanted to, when evaluate DOT it will push the ascii value for ‘.’ on stack just like we wanted. We also now have a mechanism to switch between compilation and execute mode, so we can use it to implement the word that lets you define other words, the colon-definition:

let curDef = ''

// ( "<spaces>name"-- ) parse name and append to new definition
define('DEFER', () => curDef = inputBuffer.parseName())

define(':', compile('DEFER ]'))

define('BUILD', () => {
  push(codeSpace.push(compiler.build()) - 1)
  compiler = new Compiler()
})
define('PERSIST-DEF', () => {
  const xt = pop()
  dictionary.define(curDef, codeSpace[xt]).code = xt
})

define('COMPILE,', () => compiler.code(pop()))

define('POSTPONE', () => {
  const name = inputBuffer.parseName()
  const d = dictionary.find(name)
  if (!d) {
    throw new Error(`word ${name} not found ${dictionary.dump().map(a => a.name)} `)
  }
  if (d.immediate) {
    compiler.code(d.code)
  } else {
    compiler.lit(d.code).word('COMPILE,')
  }
}).immediate = true
define(';', compile('BUILD PERSIST-DEF POSTPONE [')
).immediate = true

With this implemented, let’s implement IMMEDIATE, that allows us to define [CHAR] that makes the DOT above simpler:

// Marks last definition as immediate
define('IMMEDIATE', () =>
  dictionary.find(curDef).immediate = true)

// ( C: "<spaces>name" -- ) ( -- x ) x is the first char from name
evaluate(': [CHAR] CHAR POSTPONE LITERAL ; IMMEDIATE')

The DOT can be defined as simple as:

: DOT  [CHAR] . ;

Another type of useful defining word is CODE. This allows us defining javascript words withing the interpreter:

define('CODE', () => {
  curDef = inputBuffer.parseName()
  const start = inputBuffer.offset
  let delimCandidate
  do {
    const end = inputBuffer.offset
    delimCandidate = inputBuffer.parseName()
    if (delimCandidate === 'END-CODE') {
      return define(curDef, eval(inputBuffer
        .buffer.slice(start, end - 1)))
    }
  } while (delimCandidate)
  throw new Error('SOURCE finished before END-CODE was found')
})

This is a little hacky, but does its job and allow us to develop everything in forth here on. It will involve some copy and pasting, but this should be worth it in the end.

Cool, right? With that out of way, let start thinking on variables and constants. We can make constants by simply defining a word, like : BL 32 ;, which just pushes the value of space (0x20) into stack. But here shines the meta programming characteristics of Forth: We can define a word that defines this for us:

: CONSTANT  : POSTPONE LITERAL POSTPONE ; ;
0 CONSTANT FALSE
FALSE INVERT CONSTANT TRUE
32 CONSTANT BL

See, we defined CONSTANT, which pops from stack and assign the value to the given name. I used it to defined the constants FALSE, TRUE and BL, and then even used the value of FALSE to define TRUE! But of course we could use some variables too, but now we need someplace to store them. So first we create the following to manage dataSpace storage:

// 1KiB of RAM!
const dataSpace = new DataView(new ArrayBuffer(1024))
const herePtr = 8
dataSpace.setInt32(herePtr, 12)
push(herePtr)

( a-addr -- x ) \ gets the cell value stores at a-addr
CODE @  
  () => push(dataSpace.getInt32(pop())) 
END-CODE

( x a-addr -- ) \ set the value of a-addr to x
CODE !
  () => dataSpace.setInt32(pop(), pop())
END-CODE

\ Creating a variable manually. The pos of HERE is on Stack
CONSTANT HERE

: ALLOT  ( n -- ) \ allocates (or de-allocates if n < 0) n bytes
  HERE @ + HERE ! ;

: CELL+  ( a-addr1 -- a-addr2 ) \ a-addr2 is a-addr1 plus one cell
  4 + ;

: CELLS  ( n1 -- n2 ) \ n2 os the address in bytes for n1 cells
  DUP + DUP + ;

( c-addr -- x ) \ Like @ but only gets a UTF16 Char code
CODE C@
  () => dataStack.push(dataSpace.getUint16(pop()))
END-CODE

( x c-addr -- ) \ Like ! but only sets a UTF16 Char code
CODE C!
  () => dataSpace.setUint16(pop(), pop()) 
END-CODE

: CHAR+ ( c-addr1 -- c-addr2 ) \ c-addr2 is c-addr1 plus 1 UTF16 char
  2 + ;

: CHARS ( n1 -- n2 ) \ n2 os the address in bytes for n1 UTF16 chars
  DUP + ;

CODE B@
  () => dataStack.push(dataSpace.getUint8(pop()))
END-CODE

In summary, HERE returns the pointer to the next available byte. @ dereferences a pointer, ! sets the value pointed by a pointer. ALLOTS allocates and de-allocates space.

I organized the code into multiple lines, but the interpreter does not handle this well, so I put the sources in their respective .4th files and use the following code to wrap then into javascript:

const fs = require('fs')

const funcNames = []
for (const source of process.argv.slice(2)) {
  const lastSlash = source.lastIndexOf('/')
  const lastDot = source.lastIndexOf('.')
  const name = source.slice(lastSlash > 0 ? lastSlash : 0,
    lastDot > lastSlash ? lastDot : source.length)

  const funcName = 'load'
    + name[0].toUpperCase()
    + name.slice(1).replace(/-(\w)/g, (val) => val[1].toUpperCase())
  funcNames.push(funcName)

  console.log(`function ${funcName}(evaluate) {`)
  const code = fs.readFileSync(source, 'utf8')
    .trim()
    .replace(/([\\'"`])/g, '\\$1')
    .replace(/\n/g, '\\n\'+\n    \'')
  console.log(`  return evaluate(\n    \'${code}\'\n  )`)
  console.log('}')
}
console.log('if (typeof module !== \'undefined\') {')
for (const funcName of funcNames) {
  console.log(`  module.exports.${funcName} = ${funcName}`)
}
console.log('}')

With this I can

$ node wrapSource.js *.4th > core.js

And the I just need to make sure core.js is properly imported and the load* functions are called in order.

Back on track, we now can define the words to manipulate variables. Now on variables:

: CREATE ( "<spaces>name" -- ) \ Create pointer to contiguous data
  HERE @ CONSTANT ;

: VARIABLE ( "<spaces>name" -- ) \ Create pointer and allocate 1 cell
  CREATE 1 CELLS ALLOT ;

: , ( x -- ) \ Allocates an extra cell and stores x there
  HERE @ ! 1 CELLS ALLOT ;

: C, ( x -- ) \ Allocates an extra cell and stores x there
  HERE @ C! 1 CHARS ALLOT ;

These words, specially ALLOT and the comma words, also allow accessing multiple addresses like an array.

The previously created CODE is an optional word, from the programming toolset. It allows defining code for the underlining platform, generally assembly, but can be anything the implementation want it to be. The .S I implemented on previous article was also from this toolset. These are useful for tools for programming and debugging, so I decide to fill the remaining words of the basic part:

( -- ) \ Show all WORDS currently defined
CODE WORDS  
  () => output(dictionary.dump()
    .map(d => `${d.code}\t${d.name}\n`).join(''))
END-CODE

: ? ( a-addr -- ) \ Print the value stored at this a-address
  @ . ;

( "<spaces>name" -- ) \ Print the given word in a readable format
CODE SEE 
  () => {
    const name = inputBuffer.parseName()
    const d = dictionary.find(name)
    if (!d) {
      throw new Error(`Can't find word named ${name}`)
    }
    const postfix = d.immediate ? 'IMMEDIATE ' : ''
    const action = d.action
    const body = action.thread
      ? `() => executor([${action.thread.join(', ')}])`
      : action.toString().trim()
    output(`CODE ${d.name}\n  ${body}\nEND-CODE ${postfix}`)
  }
END-CODE

These complete the basic set of of the optional Programming tools, by the way. .S was already here, so no need to go back to it. WORDS just list all the words we have, ? gets the value at given address and SEE lets us see the word definition (this last is not as friendly as I wanted)

But we still lack basic control-flow necessary to get the remaining DUMP. Let’s fix this by first defining those low level control-flow:

: BRANCH ( n -- ) \ Branch by offset n (relative to next)
  R> + >R ;

: RECURSE ( R: -- th ) \ Call the current word being executed
  R> R@ SWAP >R >R 0 >R ;

: EXIT ( R: th -- ) \ Return early from a word
  R> R> DROP DROP ;

( x -- TRUE | FALSE ) \ Returns whether the top is equal to zero
CODE 0=
  () => push(pop() ? 0 : -1) 
END-CODE

: 0BRANCH ( n x -- ) \ Branch by offset n, when x is 0
  0= NEGATE DUP + BRANCH DROP EXIT R> + >R ;

( -- th-addr ) \ Gets address reserved to compile next instruction
CODE PC@
  () => push(compiler.buffer.length) 
END-CODE

: LIT@ ( C: -- orig ) \ Compiles a literal with value to be defined
  0 POSTPONE LITERAL PC@ ;

( C: x orig -- ) \ Defines a pending literal created by LIT@
CODE LIT!
  () => { compiler.buffer[pop() - 1] = pop() } 
END-CODE

RECURSE and EXIT are actually pretty useful as they are, RECURSE calling recursively the current word and EXIT being equivalent of return in other programming languages. 0= is also pretty simple too, it just compare to zero and pushes TRUE or FALSE. BRANCH and 0BRANCH, however, are pretty much low level, but allow us to build the higher level control flow, as such:

: IF ( C: -- orig ) ( x -- ) \ Jumps to next unpaired THEN or ELSE 
\ when the x is 0
  LIT@ POSTPONE SWAP POSTPONE 0BRANCH PC@ ; IMMEDIATE

: THEN ( C: orig -- ) \ Closes the last unpaired IF or ELSE
  NEGATE PC@ + SWAP LIT! ; IMMEDIATE

: ELSE ( C: orig1 -- orig2 ) \ Closes the last unpaired IF and if its
\ x was not 0 jumps to next unpaired THEN
  LIT@ POSTPONE BRANCH >R POSTPONE THEN R> PC@ ; IMMEDIATE

: BEGIN ( C: -- dest ) \ Starts a unbounded loop
  PC@ ; IMMEDIATE

: UNTIL ( C: dest -- ) ( x -- ) \if x is zero, jumps to dest
  PC@ - 4 - POSTPONE LITERAL POSTPONE SWAP POSTPONE 0BRANCH
; IMMEDIATE

: AGAIN ( C: dest -- ) ( x -- ) \ jumps to dest
  PC@ - 3 - POSTPONE LITERAL POSTPONE BRANCH ; IMMEDIATE

: WHILE ( C: dest -- orig dest ) ( x -- ) \ jumps to next unpaired 
\ REPEAT if x is zero
  POSTPONE IF ROT ; IMMEDIATE

: REPEAT ( C: orig dest -- ) \ when last unpaired WHILE's x was not 
\ zero, jump to dest
POSTPONE AGAIN POSTPONE THEN ; IMMEDIATE

: ?DUP ( n -- n n | 0 ) \ Duplicate only if it is not zero
  DUP IF DUP THEN ;

: DUMP ( addr u -- ) \ Print all values within range
  BEGIN ?DUP WHILE 1 - >R DUP B@ . 1 + R> REPEAT DROP ;

There is a few more control flows, but this is good enough for me right now, so I leave as an exercise for the reader :stuck_out_tongue:

Finally, lets deal with strings. I first create a few functions to allow moving string from and to our memory. I also will reserve 80 chars for temporary storing inputs. So finally we can implement SKIP, PARSE and PARSE-NAME. With that we can create several basic words below.

function storeStr(pos, str) {
  for (const ch of str) {
    dataSpace.setUint16(pos, ch.charCodeAt())
    pos += 2
  }
  return str.length
}
function pushStr(pos, str) {
  storeStr(pos, str)
  push(pos)
  push(str.length)
}
function loadStr(pos, len) {
  const arr = new Array(len)
  for (let i = 0; i < len; i++) {
    arr[i] = String.fromCharCode(dataSpace
      .getUint16(pos + i * 2))
  }
  return arr.join('')
}
function popStr() {
  const sz = pop()
  return loadStr(pop(), sz)
}
const scratchPtr = dataSpace.getInt32(herePtr)
dataSpace.setInt32(herePtr, scratchPtr + 160)


( n "nnnn*" -- ) \ Skip zero or more consecutive chars with value n
CODE SKIP
  () => inputBuffer.skip(String.fromCharCode(pop()))
END-CODE

( n "[^n]*n" -- c-addr u ) \ Parse 0 or more characters until the next
\ char n is found or the input is exausted. Return the address and
\ size of parsed string. The delimiter is excluded, but not counted
CODE PARSE () => {
  const str = inputBuffer.parse(String.fromCharCode(pop()))
  pushStr(scratchPtr, str.slice(0, 80))
} END-CODE 

( "<spaces>name" -- c-addr u ) \ Skip consecutive spaces, parse a 
\ valid name and return its address and count
CODE PARSE-NAME () => {
  const str = inputBuffer.parseName()
  pushStr(scratchPtr, str.slice(0, 80))
} END-CODE 

: 2DUP ( n1 n2 -- n1 n2 n1 n2 ) \ Duplicate the 2 elements on top
  OVER OVER ;

: MOVE  ( addr1 addr2 u -- ) \ Copy u bytes from addr1 to addr2
  BEGIN ?DUP WHILE >R OVER @ OVER ! >R 1 + R> 1 + R> 1 - REPEAT
  DROP DROP ;

: S" ( C: ".*<quote>" ) ( -- c-addr u ) \ Save content up to next
\ quote on compilation time and push it addres and chars count on 
\ run time
  [CHAR] " PARSE >R HERE @ SWAP OVER R@ CHARS DUP ALLOT
  MOVE POSTPONE LITERAL R> POSTPONE LITERAL ; IMMEDIATE

: TYPE ( c-addr u -- ) \ Print u chars starting at c-addr
  BEGIN ?DUP WHILE >R DUP C@ EMIT CHAR+ R> 1 - REPEAT DROP ;

: ." ( C: ".*<quote>" -- ) \ Save content up to next quote on compile 
\ time and prints it on runtime
  POSTPONE S" POSTPONE TYPE ; IMMEDIATE

: .( ( ".*<c-paren>" -- ) \ Save content up to next closing bracket
\ and print immediatelly
  [CHAR] ) PARSE TYPE ; IMMEDIATE

Finally, to add the final touch to this, let’s have a few more meta functions and implement an interpreter itself using forth.

( "<spaces>name" -- xt ) \ Pushes the execution token for name
CODE ' 
  () => push(addressOf(inputBuffer.parseName()))
END-CODE

: ['] ( C: "<spaces>name" -- ) ( -- xt ) \ On compilation, parses the 
\ next name and on runtime pushes its execution token
  ' POSTPONE LITERAL ; IMMEDIATE

( c-addr u -- xt n | 0 ) \ Looks for string and return its xt and 1 
\ if immediate or -1 if not. If not found reuturn just 0
CODE FIND-S () => {
  const sz = pop()
  const addr = pop()
  const d = dictionary.find(loadStr(addr, sz))
  if (d) {
    push(d.code)
    push(d.immediate ? 1 : -1)
  } else {
    push(addr)
    push(sz)
    push(0)
  }
} END-CODE

( x*i xt -- x*j ) \ Pops the given execution token and executes it
CODE EXECUTE 
  () => codeSpace[pop()]() 
END-CODE

( -- flag ) \ Pushes -1 if compiling, 0 if executing
CODE IS-COMPILING
  () => push(compiling ? -1 : 0)
END-CODE

( n -- flag ) \ Return -1 if n less than 0, zero if equals or larger
CODE 0<
  () => push(pop() < 0 ? -1 : 0) 
END-CODE

: 0> ( n -- flag ) \ Return -1 if n larger than 0, zero if equals or 
\ less
  NEGATE 0< ;

: 0<> ( n -- flag ) \ Returns -1 if n is not equals to 0
  0= INVERT ;

: =
  - 0= ;

: <
  - 0< ;

: >
  - 0> ;

: <>
  = INVERT ;

( x1 x2 -- x3 ) \ Return the binary 'or' between x1 and x2
CODE OR
  () => push(pop() | pop())
END-CODE

( x1 x2 -- x3 ) \ Return the binary 'and' between x1 and x2
CODE AND
  () => push(pop() & pop())
END-CODE

: WITHIN  ( test low high -- flag ) 
  2DUP < IF
    >R OVER > INVERT SWAP R> < AND
  ELSE
    >R OVER > INVERT SWAP R> < OR
  THEN
  ;
  
( x1 x2 -- x3 ) \ multiply
CODE * 
  () => push(pop() * pop())
END-CODE

: >NUMBER ( prev c-addr1 u1 -- curr c-addr2 n2 )
  ROT >R -1
  BEGIN OVER 0<> AND WHILE
    OVER C@ DUP [CHAR] 0 [CHAR] 9 1 + WITHIN IF
      [CHAR] 0 - R> 10 * + >R 1 - SWAP 1 CHARS + SWAP
      -1
    ELSE DROP 0 THEN
  REPEAT
  R> ROT ROT
;

VARIABLE ABORTED
0 ABORTED !

: ABORT ( -- ) \ 
  1 ABORTED !
;

: ?ABORT ( n -- )
  IF ." ?" ABORT THEN
;

: >NUMBER-WORD ( c-addr u -- n )
  OVER C@ DUP [CHAR] - = IF
    DROP 1 - SWAP
    CHAR+ SWAP
    0 ROT ROT
    >NUMBER
    ?ABORT
    DROP
    NEGATE
  ELSE 
    [CHAR] + = IF 
      1 - SWAP
      CHAR+ SWAP
    THEN
    0 ROT ROT
    >NUMBER
    ?ABORT
    DROP
  THEN
;

VARIABLE STATE
0 STATE !

: INTERPRET  ( x*i -- y*j )
  BEGIN PARSE-NAME ?DUP ABORTED @ 0= AND WHILE 
    FIND-S ?DUP IF
      -1 = STATE @ AND IF 
        COMPILE,
      ELSE
        EXECUTE
      THEN
      IS-COMPILING STATE !
    ELSE
      >NUMBER-WORD
      STATE @ IF POSTPONE LITERAL THEN
    THEN
  REPEAT
  DROP
  ABORTED @ IF ."  Err" 0 ABORTED ! DROP ELSE ."  Ok" THEN CR
;

And voilà, we have minimal weird forth system! As you can see on the sources here, we even use the INTERPRET as it at least correctly flag invalid words as error. You can also see the live demo here and inspect the source.

This was a little tiresome, but I still want to make some adjustments. First, I want to move the stacks an the user defined words to dataSpace, and other stuff that I now see a flawed, but this really is too big already, so Bye for now.