Lexing in JS style 😎

This post is a continuation of the AIM Project series, so if you haven't already I recommend you to read previous posts for answers to any How? and Why? questions.

In this article, it's time to start coding the AIM language for real! I'll start by creating a lexer. A lexer, or if you don't like cool names - tokenizer, is a tool that converts human-readable text into a list of tokens for later processing. It's being used in creating programming languages but also for text processing and various other things. So, just to note that this doesn't apply only to creating programming languages. Now, take a look at the example here:

"128 + 428"

Basic, super-simple addition of two numbers. Now let's see how this can be changed to the form of tokens:

["128", "+", "428"]

Tokens don't have to be only strings. These can be for example objects containing additional metadata for later use. This tutorial will showcase how to implement a basic lexer to convert from one to the other of the forms above.


Tooling

Naturally, there's a lot of libraries and other even bigger creations for this sort of stuff. The most popular ones include moo and lex. There are even whole toolkits that aid you when creating lexers and parsers e.g. nearley and jison. Besides, these lists can be much, much longer for other more specialized in this field languages (like C/C++), but this time it's JavaScript or rather TypeScript only. πŸ˜€ By utilizing these you can get the job done pretty easily and fast. But, this isn't the purpose of this tutorial and the AIM project as a whole to just use different libraries. No, this is going to be self-implemented from the ground up. Now - let's get started!


Definition

Let's start by defining how our lexer should look like.
It should:

  • Implement all of AIM's grammar in portable and extendable form;
  • Progressively scan given text token by token;
  • Have nice ways of iterating processed tokens;
  • Provide methods of basic edition of tokens and their lists.

This is pretty basic stuff - everything you should expect from the properly built lexer. Next, we need to decide how to exactly create our lexer. There are 3 standard solutions for this kind of software:

  • By using multiple regexps;
  • By using single regexp;
  • By reading text character by character.

Here we're going with the second option. Firstly it's pretty easy to process text by utilizing regexps. They allow us to easily extend our grammar whenever and however we want. Also, reading text char by char isn't an optimal solution when the grammar is meant to be changed or developed. Finally, as for the first option, single regexp should provide a bit better performance.


Let's code!

I've decided to divide code into 3 basic files:

  • grammar.ts - file where the grammar is defined for later use,
  • lexer.ts - a place for basic Lexer class,
  • token.ts - a place for Token class.

lexer.ts

I'll start by defining the Lexer class and its methods:

import Token from "./token";

export interface GrammarStruct {
  id: string;
  match: string;
}

export default class Lexer {
  private index: number = 0;
  private expr: string = "";
  private regex?: RegExp;
  public tokens: Token[] = [];
  public column: number = 1;
  public line: number = 1;
  public data: string = "";
  public grammar: GrammarStruct[] = [
    {
      id: "newline",
      match: "\\n"
    },
    {
      id: "whitespace",
      match: "\\s"
    }
  ];


  private getRegex() {}
  public loadDefinition(def: GrammarStruct) {}
  public loadGrammar(grammar: GrammarStruct[]) {}
  public loadData(data: string) {}
  public next() {}
  public processAll() {}
  public update() {}
  public empty() {}
}

Let's investigate this boilerplate further and cover the codes for listed methods later on.

The beginning is marked with an import of Token class and definition of GrammarStruct interface for specifying how single-token-matching regexp container should look like. Next comes the Lexer class with few properties which names speak for themselves. 3 of them are marked as private i.e index, expr and regex as these are handled by the lexer and shouldn't be used outside of it. Now, let's move on to the methods then.

// ...
private getRegex() {
    if (!this.regex) {
      this.regex = new RegExp(this.expr, "gmu");
      console.log(this.regex);
    }
    this.regex.lastIndex = this.index;
    return this.regex;
  }
// ...

12345678910

First internal method getRegex() is used to generate a single regexp from passed expr (which is generated from joined GrammarStruct matchers) and ensure that the lastIndex is properly set when the regexp needed to be regenerated (when adding new GrammarStruct).

// ...
public loadDefinition(def: GrammarStruct) {
    if (this.expr.length > 0) this.expr += "|";
    this.expr += `(${def.match})`;
    this.regex = undefined;
    this.grammar.push(def);

    return this;
}

public loadGrammar(grammar: GrammarStruct[]) {
    for (const def of grammar) {
      this.loadDefinition(def);
    }

    return this;
}
// ...

The loadDefinition() and loadGrammar() function are responsible for loading GrammarStructs i.e. combining them into single matching expression. loadDefinition() loads single GrammarStruct(matcher definition), while loadGrammar() loads array of them (whole grammar). this is returned for easier chainability (applies to other methods too).

// ...
public loadData(data: string) {
    this.data += data;

    return this;
}
// ...

loadData() does what its name implies - load more data for the lexer. Data is just a string, appended to the longer in-lexer one.

// ...
public next() {
    const regex = this.getRegex();
    const match = regex.exec(this.data);
    if (match) {
      const length = match[0].length;
      const token = this.grammar[match.indexOf(match[0], 1) - 1];
      const id = token.id;
      this.index += length;
      this.tokens.push(
        new Token(
          {
            column: this.column,
            line: this.line,
            value: match[0],
            length,
            id
          },
          this
        )
      );
      if (id === "newline") {
        this.column = 1;
        this.line++;
      } else if (id === "whitespace") {
        this.column++;
      } else {
        this.column += length;
      }

      return this.tokens[this.tokens.length - 1];
    }
}
// ...

next() is a bit trickier than any of the methods before. But there's nothing magical about this one as well. It just matches the next token in data using regexp, processes it and adds new Tokenbased on the generated data to the list i.e. its location, length, and ID. It additionally checks for any newlines and whitespaces (matchers for them are predefined by default in Lexer) and handles them properly to calculate locations (line and column number) of each token.

// ...
public processAll() {
    for (let i = 0; i < Infinity; i++) {
      const token = this.next();
      if (!token) break;
    }

    return this.tokens;
}
// ...

processAll() is just a derivative from the next() method. Basically what it does is it matches all tokens possible in supplied data until no token can be found and returns the whole list of them at once.

// ...
public update() {
    this.tokens = this.tokens
      .filter(token => {
        return token.value && token.value !== "";
      })
      .sort((a, b) => {
        const line = a.line - b.line;
        const column = a.column - b.column;
        return line === 0 ? column : line;
      })
      .map((token, index, tokens) => {
        if (index > 0) {
          const previous = tokens[index - 1];
          if (previous.id === "newline") {
            return token.moveTo(previous.line + 1, 1, false);
          }
          return token.moveTo(
            previous.line,
            previous.column + previous.length,
            false
          );
        } else {
          return token.moveTo(1, 1, false);
        }
      });

    return this;
  }
// ...

update() is another big player in the game. It sorts and arranges the tokens array in a clean, functional way. First, the array is filtered against tokens that are empty - have no value. Next, they're sorted by their respected locations. Lastly, tokens are mapped to arrange them to start from the line and column number 1, which involves checks for newlines and whitespaces. This method has its usage later on in most of Token class methods.

// ...
public empty() {
    this.data = "";
    this.line = 1;
    this.column = 1;
    this.index = 0;
    this.tokens = [];

    return this;
}
// ...

empty() method closes the list. It does the dirty work of emptying Lexer's data for its reuse (grammar definitions remains loaded).

And that's it for the Lexer class! It isn't that much complicated - if even complicated at all! But that's how everything should be - why make a big problem from something so easy to solve? Of course, some improvements probably can be made, but the basic idea remains the same.


token.ts

In this file, the even more simple Token class is declared. It basically looks like this:

import Lexer from "./lexer";

interface TokenData {
  value: string;
  id: string;
  line: number;
  column: number;
  length: number;
}
export default class Token implements TokenData {
  public value: string;
  public id: string;
  public line: number;
  public column: number;
  public length: number;
  private lexer: Lexer;
  
  public constructor(params: TokenData, ctx: Lexer) {
    this.lexer = ctx;
    this.set(params, false);
  }
  public setValue(newValue: string, update = true) {}
  public moveTo(line?: number, column?: number, update = true) {}
  public moveBy(line?: number, column?: number, update = true) {}
  public set(params: Partial<TokenData>, update = true) {}
  public remove() {}
}

At the very beginning, we have an import of the Lexer class for types definition purposes and declaration of the TokenData interface, which defines all values needed to create a new token. Token class is nothing more than a simple collector for basic data with some helper functions. Lexer is required to be passed as so-called context for later interaction between its methods and Token API.

// ...
public setValue(newValue: string, update = true) {
    this.value = newValue;
    this.length = newValue.length;
    if (update) {
      this.lexer.update();
    }
    return this;
}
// ...

setValue() does exactly what it's meant to do - sets the value of token and also its length. This is one of many token-editing methods, which can be optionally used for the basic edition of generated tokens. Its second parameter, with a default value of true, indicates if Lexer should call update() method after all other tasks.

// ...
public moveTo(line?: number, column?: number, update = true) {
    line && (this.line = line);
    column && (this.column = column);
    if (update) {
      this.lexer.update();
    }
    return this;
}

public moveBy(line?: number, column?: number, update = true) {
    line && (this.line += line);
    column && (this.column += column);
    if (update) {
      this.lexer.update();
    }
    return this;
}
// ...

moveTo() and moveBy() are utility methods used to reposition already matched tokens. moveTo()moves token to specified line and column and moveBy() moves it by a given amount of lines and columns. After the movement is indicated, the token is moved in the array by Lexer's update()method.

// ...
public set(params: Partial<TokenData>, update = true) {
    this.value = params.value || this.value;
    this.id = params.id || this.id;
    this.line = params.line || this.line;
    this.column = params.column || this.column;
    this.length = params.length || this.length;
    if (update) {
      this.lexer.update();
    }
    return this;
}
// ...

set() is used to set different values of the token with a single call.

// ...
public remove() {
    this.value = undefined;
    this.id = undefined;
    this.line = undefined;
    this.column = undefined;
    this.length = undefined;
    this.lexer.update();
 }
 // ...

remove() removes all token's values and runs the update() method, where the token is filtered out from the list because of its lack of value.

So, Token class mainly features some methods for editing its data. It may not always be needed but it's a good functionality to have.


grammar.ts

import { GrammarStruct } from "./lexer";

const grammar: GrammarStruct[] = [
  // Comments
  {
    id: "single_line_comment_begin",
    match: ">>>"
  },
  {
    id: "multi_line_comment_begin",
    match: ">>"
  },
  {
    id: "multi_line_comment_end",
    match: "<<"
  }
  // ...
]

export default grammar;

In grammar.ts file, we define our grammar in for of array of objects. We provide id as an identifier for the type of matched token and match as a regexp in form of string for later concatenation. One thing to note here. Because our complete regexp is being generated in a linear way, the right order of GrammarStruct matchers must be kept.


In one piece

After all the code above is put together (you can find the full source code at core package of AIMmulti-repo) its time to use this creation! It all comes down to as much as the code below:

import Lexer from "../src/lexer";
import grammar from "../src/grammar";

const AIMLexer = new Lexer().loadGrammar(grammar);
AIMLexer.loadData("public variable #int32 = 1")
AIMLexer.processAll()

Now, I may end this story here, but there's one more catch. You see, the lexer is used only to process linear text into an array of tokens. It's the job of yet another tool - parser - to read/process them in the right way. The one aspect that's especially well-related to this problem is the implementation of strings in our grammar. That's mainly because of an idea of creating something like JS template literals in AIM. How can you match with a single regexp all of the possibilities i.e. escaped values, characters, and anchors?

"text\$${value}text"

The simple answer is you don't. Maybe the solution is obvious to some of you but it really required some deep thinking from me (most likely I wasn't open-minded enough). You have to work with string char by char (at least this is what I came up with). Take a look at part of my grammar definitions array for example.

[
    // ...
    {
        id: "char",
        match: `(?<=(?:(?:\\b|^)["'\`])|[\\x00-\\x7F])[\\x00-\\x7F](?=(?:[\\x00-\\x7F]+)?["'\`](?:\\b|$))`
    },
    // ...
    // Anchors and brackets
    {
        id: "string_anchor",
        match: "['`\"]"
    }
    // ...
]

What I've done is I divided the string into anchors and chars. This way, when I match it against any given string, I'll be welcomed with many different tokens with id of char and... that's totally fine! I can later process it with parser into final, good-looking, AST form.


It's only the beginning

The lexer is just a small piece of cake when compared to parser and compiler especially. But it's really important to have all puzzles in right place. Only when the base is solid, the tower won't fall. With that said, I think that some changes might occur to lexer's code (mainly at the time of writing parser) but the main idea will remain the same.

Again, if you want to take a look at full code go to AIM repo. If you want to follow the process of AIM development more closely, consider staring the repo or following me on Twitter. 😎