Saturday, November 12, 2011

Scanning

This is the second post in my series on compiler construction techniques.


I'm a big fan of learning by doing. To that end, each post will be accompanied by running source code which you can find at the omakase project on github.


We'll need a language to compile. All real languages have features, quirks, idiosyncrasies, call them what you will, which make writing compilers for them challenging. Rather than tackling all the complexities immediately, I'm going to start with a gentle introduction. We'll start with a simple, curly brace style, imperative, typeless language with classes. Overtime we'll extend the language to add types, advanced control flow constructs as well as many of the common quirks that are present in 'real' languages which make them challenging to compile.


To start with 'Hello world' will look something like this:


// hello.oma
class Program {
  main() {
    print("Hello World!");
  }
}


Compilers are large, complex programs which can easily devolve into a coding horror. The key to building a successful compiler is a modular design. The first phase in the compiler is called scanning(also known as lexing or tokenization). Scanning breaks the input text into tokens and discards the whitespace (spaces, tabs and newlines) and comments. Tokens are the smallest units in the program above individual characters and represent punctuation(like '{', '(' and ';'), keywords ('class'), identifiers ('main', 'Program', and 'print') and literals ("Hello World!"). The program for this blog entry just scans the entire input file into tokens, and prints each token one per line. The output from scanning the above example is:



$ omascan.sh hello.oma

hello.oma (2, 1): CLASS
hello.oma (2, 7): IDENTIFIER Program
hello.oma (2, 15): OPEN_CURLY
hello.oma (3, 3): IDENTIFIER main
hello.oma (3, 7): OPEN_PAREN
hello.oma (3, 8): CLOSE_PAREN
hello.oma (3, 10): OPEN_CURLY
hello.oma (4, 5): IDENTIFIER print
hello.oma (4, 10): OPEN_PAREN
hello.oma (4, 11): STRING "Hello World!"
hello.oma (4, 25): CLOSE_PAREN
hello.oma (4, 26): SEMI_COLON
hello.oma (5, 3): CLOSE_CURLY
hello.oma (6, 1): CLOSE_CURLY
hello.oma (7, 1): END_OF_FILE

Some things to note about each token: 
  • Each token knows its location in the original source including the file it is contained in, as well as its start (and ending ) lone and column position in the source file. 
  • Each token has a kind (CLASS, IDENTIFIER, OPEN_CURLY, ...).
  • For many tokens location and kind are all they contain, while some token kinds, IDENTIFIER and STRING, have additional information. The name of the identifier, or the value of the string constant respectively.
Tokens are represented as instance of the token class:

public class Token {
  public final TokenKind kind;
  public final SourceRange location;
}


Identifiers and strings are represented by classes which extend Token:


public class IdentifierToken extends Token {
  public final String value;
}
public class StringLiteralToken extends Token {
  public final String value;
}

The comment as well as whitespace characters do not appear in the token list. The END_OF_FILE token is added to indicate the end of the input file. While this token doesn't correspond to any characters in the file, appending it to the end of the token stream will come in handy in the next stage of the compiler.

One of the most important features of Tokens are that they are immutable - once created, tokens do not change. Compilers create a new representation of the input program, they do not modify the input source files; making the tokens and source locations immutable suits our problem space.  Where sensible, immutable data structures will save you a ton of headaches. 



OK, that's all the time I have for tonight. In the next installment, we'l take a tour of the scanner implementation

Thursday, November 10, 2011

Leave it to the Chef ...


Over the last couple of weeks an old friend of mine has been asking my advice on compiler construction. We talk for an hour or so, then he'll go away for a couple of days only to be back with more questions. I’ve worked on a few compilers over the years and apparently I've got a lot to say about the subject. I've seen some tasty morsels and some compilers I wouldn't serve my dog. Heck I've written plenty of both. I'm going to use this space to present some of the lessons I've learned in compiler writing - both best practices and pitfalls to avoid.

Whenever you are in the mood to talk compilers come on over and sit at the counter for a little compiler construction served up fresh ...