Saturday, November 12, 2011

Scanning

This is the second post in my series on compiler construction techniques.


I'm a big fan of learning by doing. To that end, each post will be accompanied by running source code which you can find at the omakase project on github.


We'll need a language to compile. All real languages have features, quirks, idiosyncrasies, call them what you will, which make writing compilers for them challenging. Rather than tackling all the complexities immediately, I'm going to start with a gentle introduction. We'll start with a simple, curly brace style, imperative, typeless language with classes. Overtime we'll extend the language to add types, advanced control flow constructs as well as many of the common quirks that are present in 'real' languages which make them challenging to compile.


To start with 'Hello world' will look something like this:


// hello.oma
class Program {
  main() {
    print("Hello World!");
  }
}


Compilers are large, complex programs which can easily devolve into a coding horror. The key to building a successful compiler is a modular design. The first phase in the compiler is called scanning(also known as lexing or tokenization). Scanning breaks the input text into tokens and discards the whitespace (spaces, tabs and newlines) and comments. Tokens are the smallest units in the program above individual characters and represent punctuation(like '{', '(' and ';'), keywords ('class'), identifiers ('main', 'Program', and 'print') and literals ("Hello World!"). The program for this blog entry just scans the entire input file into tokens, and prints each token one per line. The output from scanning the above example is:



$ omascan.sh hello.oma

hello.oma (2, 1): CLASS
hello.oma (2, 7): IDENTIFIER Program
hello.oma (2, 15): OPEN_CURLY
hello.oma (3, 3): IDENTIFIER main
hello.oma (3, 7): OPEN_PAREN
hello.oma (3, 8): CLOSE_PAREN
hello.oma (3, 10): OPEN_CURLY
hello.oma (4, 5): IDENTIFIER print
hello.oma (4, 10): OPEN_PAREN
hello.oma (4, 11): STRING "Hello World!"
hello.oma (4, 25): CLOSE_PAREN
hello.oma (4, 26): SEMI_COLON
hello.oma (5, 3): CLOSE_CURLY
hello.oma (6, 1): CLOSE_CURLY
hello.oma (7, 1): END_OF_FILE

Some things to note about each token: 
  • Each token knows its location in the original source including the file it is contained in, as well as its start (and ending ) lone and column position in the source file. 
  • Each token has a kind (CLASS, IDENTIFIER, OPEN_CURLY, ...).
  • For many tokens location and kind are all they contain, while some token kinds, IDENTIFIER and STRING, have additional information. The name of the identifier, or the value of the string constant respectively.
Tokens are represented as instance of the token class:

public class Token {
  public final TokenKind kind;
  public final SourceRange location;
}


Identifiers and strings are represented by classes which extend Token:


public class IdentifierToken extends Token {
  public final String value;
}
public class StringLiteralToken extends Token {
  public final String value;
}

The comment as well as whitespace characters do not appear in the token list. The END_OF_FILE token is added to indicate the end of the input file. While this token doesn't correspond to any characters in the file, appending it to the end of the token stream will come in handy in the next stage of the compiler.

One of the most important features of Tokens are that they are immutable - once created, tokens do not change. Compilers create a new representation of the input program, they do not modify the input source files; making the tokens and source locations immutable suits our problem space.  Where sensible, immutable data structures will save you a ton of headaches. 



OK, that's all the time I have for tonight. In the next installment, we'l take a tour of the scanner implementation

1 comment:

  1. Cheap Oakley Sunglasses I am able to safely and securely claim of which dance-wise it’s this nearest thing I've got previously go to burning off the intellect. Oakley Sunglasses Outlet To help humor: as i traveled to a tasteful stand-up mixture gathering from the Oakley Sunglasses Sale developing many 2 or 3 weeks before, cheap ray ban sunglasses When i seemed to be migrated to help enact for just a man swiller connected with light vino a little cheap raybans choreography i always visualize for the reason that Disco Pony. scarf Being created some sort of closed fist having the suitable give then rearing the item in excess of the scalp, burberry scarf When i turned the item though located on the eventually left foot or so in addition to easily pivoting 360 college diplomas. burberry scarves Which often brought about a oncoming waitress to help chastise everyone with the “Excuse everyone, mister! ” cheap louis vuitton handbags Although I just started off accomplishing Zumba, louis vuitton handbags this show up in addition to health occurrence of which, burberry scarf sale with 125 places, burberry cashmere scarf has become whipping many 12 mil persons in a lather for the last few years, pink burberry scarf as well as all people by Wyclef Jean towards article author Myra Orlean.

    ReplyDelete