Saturday, November 12, 2011

Scanning

This is the second post in my series on compiler construction techniques.


I'm a big fan of learning by doing. To that end, each post will be accompanied by running source code which you can find at the omakase project on github.


We'll need a language to compile. All real languages have features, quirks, idiosyncrasies, call them what you will, which make writing compilers for them challenging. Rather than tackling all the complexities immediately, I'm going to start with a gentle introduction. We'll start with a simple, curly brace style, imperative, typeless language with classes. Overtime we'll extend the language to add types, advanced control flow constructs as well as many of the common quirks that are present in 'real' languages which make them challenging to compile.


To start with 'Hello world' will look something like this:


// hello.oma
class Program {
  main() {
    print("Hello World!");
  }
}


Compilers are large, complex programs which can easily devolve into a coding horror. The key to building a successful compiler is a modular design. The first phase in the compiler is called scanning(also known as lexing or tokenization). Scanning breaks the input text into tokens and discards the whitespace (spaces, tabs and newlines) and comments. Tokens are the smallest units in the program above individual characters and represent punctuation(like '{', '(' and ';'), keywords ('class'), identifiers ('main', 'Program', and 'print') and literals ("Hello World!"). The program for this blog entry just scans the entire input file into tokens, and prints each token one per line. The output from scanning the above example is:



$ omascan.sh hello.oma

hello.oma (2, 1): CLASS
hello.oma (2, 7): IDENTIFIER Program
hello.oma (2, 15): OPEN_CURLY
hello.oma (3, 3): IDENTIFIER main
hello.oma (3, 7): OPEN_PAREN
hello.oma (3, 8): CLOSE_PAREN
hello.oma (3, 10): OPEN_CURLY
hello.oma (4, 5): IDENTIFIER print
hello.oma (4, 10): OPEN_PAREN
hello.oma (4, 11): STRING "Hello World!"
hello.oma (4, 25): CLOSE_PAREN
hello.oma (4, 26): SEMI_COLON
hello.oma (5, 3): CLOSE_CURLY
hello.oma (6, 1): CLOSE_CURLY
hello.oma (7, 1): END_OF_FILE

Some things to note about each token: 
  • Each token knows its location in the original source including the file it is contained in, as well as its start (and ending ) lone and column position in the source file. 
  • Each token has a kind (CLASS, IDENTIFIER, OPEN_CURLY, ...).
  • For many tokens location and kind are all they contain, while some token kinds, IDENTIFIER and STRING, have additional information. The name of the identifier, or the value of the string constant respectively.
Tokens are represented as instance of the token class:

public class Token {
  public final TokenKind kind;
  public final SourceRange location;
}


Identifiers and strings are represented by classes which extend Token:


public class IdentifierToken extends Token {
  public final String value;
}
public class StringLiteralToken extends Token {
  public final String value;
}

The comment as well as whitespace characters do not appear in the token list. The END_OF_FILE token is added to indicate the end of the input file. While this token doesn't correspond to any characters in the file, appending it to the end of the token stream will come in handy in the next stage of the compiler.

One of the most important features of Tokens are that they are immutable - once created, tokens do not change. Compilers create a new representation of the input program, they do not modify the input source files; making the tokens and source locations immutable suits our problem space.  Where sensible, immutable data structures will save you a ton of headaches. 



OK, that's all the time I have for tonight. In the next installment, we'l take a tour of the scanner implementation

6 comments:

  1. Cheap Oakley Sunglasses I am able to safely and securely claim of which dance-wise it’s this nearest thing I've got previously go to burning off the intellect. Oakley Sunglasses Outlet To help humor: as i traveled to a tasteful stand-up mixture gathering from the Oakley Sunglasses Sale developing many 2 or 3 weeks before, cheap ray ban sunglasses When i seemed to be migrated to help enact for just a man swiller connected with light vino a little cheap raybans choreography i always visualize for the reason that Disco Pony. scarf Being created some sort of closed fist having the suitable give then rearing the item in excess of the scalp, burberry scarf When i turned the item though located on the eventually left foot or so in addition to easily pivoting 360 college diplomas. burberry scarves Which often brought about a oncoming waitress to help chastise everyone with the “Excuse everyone, mister! ” cheap louis vuitton handbags Although I just started off accomplishing Zumba, louis vuitton handbags this show up in addition to health occurrence of which, burberry scarf sale with 125 places, burberry cashmere scarf has become whipping many 12 mil persons in a lather for the last few years, pink burberry scarf as well as all people by Wyclef Jean towards article author Myra Orlean.

    ReplyDelete
  2. csino 2021 | Shootercasino
    ‎Chips for CSGO & Esports · ‎Game Modes · ‎Online Tournaments · ‎CS:GO 카지노 Betting & Casino · fun88 vin ‎Bonus 11bet Codes

    ReplyDelete
  3. Casinos Near Casinos and Hotels - MapYRO
    Find Casinos 평택 출장샵 Near 용인 출장샵 Casinos and Hotels in Dallas, 전주 출장샵 Irving, Lawrence, Cusack, and 부천 출장안마 more near Dallas, Texas, United States. 상주 출장안마

    ReplyDelete
  4. The proliferation of rogue operators of distant betting sites is one aspect that makes avid local gamblers cautious of online playing. Yet in South Korea, this issue has been supplied with a solution by suppliers of a service recognized as|often recognized as} eat and run 벳익스플로어 verification. The proposed applied sciences can replace current methodologies, that are sometimes depending on guide reporting, to rapidly and precisely classify roughly 27,000 spam messages may be} sent to KISA each day. In specific, the system proposed on this paper can provide a URL pool to rapidly block illegal playing sites based on compiled spam SMS actions.

    ReplyDelete
  5. Land two scatter symbols on reels 1, three and 5 for a bonus round. You get to re-spin an emblem and, who is aware of}, maybe you'll win sufficient cash to lastly purchase that one-of-a-kind poster signed by The Prince of Darkness himself. The 5x20 slot contains bonus options that come within the form of free spins and image cost ups, and if you activate them, there are some pretty great prizes in retailer. Slot video games do get darkish generally, and Ozzy Osbourne video slot isn’t for the faint-hearted; with its rock music theme featuring vicious ravens, skulls, and black bats, this 1xbet game has one significantly spooky really feel. In fact, the slot has every thing you'll count on from the enduring rock legend it’s based mostly on.

    ReplyDelete
  6. Use these developer assets to simply combine add-ons and third-party services. Learn how manufacturers in your trade are utilizing Optimove to improve every buyer KPI. Optimove’s knowledge scientists create a bespoke predictive buyer mannequin for each client. In 2021, DraftKings Inc. and Golden Nugget Online Gaming Inc. have agreed to an all-stock deal by which DraftKings will purchase Golden Nugget 1xbet Online Gaming.

    ReplyDelete