9

Suppose I had a grammar like:

object 
    { members } 
members 
    pair
pair
    string : value 
value 
    number
    string
string 
    " chars " 
chars 
    char
    char chars 
number
    digit
    digit number

I could parse the following example: { "one" : 1234 }

As far as I understand, I should have the tokens object, members, pair, value, string and chars.

Tokenizing the example should produce

object
    ->members
        ->pair
            ->"one"
            ->"1234"

Parsing the tokens should produce

object
    ->pair
        ->"one"
        ->1234

It seems to me like the tokenizer is either useless or I don't fully understand what a it should do.

What is the responsibility of a tokenizer? What is the benefit of a tokenizer over parsing the original string?

gnat
  • 20,543
  • 29
  • 115
  • 306
Johannes
  • 346
  • 3
  • 9

5 Answers5

20

You don't seem to understand what a tokenizer should do. In this example, I'd make the tokenizer recognize six tokens: {, }, :, string, number. The tokenizer produces a string/sequence of tokens, not a tree. And instead of the grammar being written in terms of individual characters (char, digit), it is now written in terms of tokens.

The benefit is that this simplifies the grammar and parser: You no longer need to describe how to parse strings and numbers (note that real languages' string and numeric literals are far more complicated, which boosts this benefit). As far as the parser is concerned, the grammar becomes

object 
    '{' members '}'
members 
    pair
pair
    string ':' value 
value 
    number
    string

Which is not only simpler to write a parser for, but also more useful for comprehending the syntactic structure of programs. I know what a string literal is, the interesting part is how I can combine string literals and other atomic units to form programs.

7

You bring up an excellent point. I'm going to disagree with the other answers here, and say that the main goal of a tokenizer is to get better performance during parsing -- i.e., tokenizers are an optimization: an implementation detail of parsing, but not a fundamental one. A large part of the time spent parsing is breaking the input string up into pieces. By optimizing this, the parser's performance can be greatly increased.

So that's a pretty vague definition I just gave, and that's why it's hard to precisely define what a tokenizer should do.

Many languages are defined using two separate grammars: one for tokens, and one for hierarchical syntax elements. You could argue that the purpose of a tokenizer is to implement the token grammar, but this misses the point: splitting a grammar into token and hierarchical grammars is arbitrary and unnecessary from the point of view of expressiveness (although, again, useful as a performance optimization).

It's perfectly reasonable and practical to implement parsers without separate tokenizers, although it's likely the performance will be worse.

It is important to note that there are drawbacks to using a separate tokenizer. One is that the token grammar can become restricted (example, another example). In my personal experience, avoiding separate tokenization reduces the overall complexity (LOC, interfaces between subsystems, etc.) of a parser.

5

The original source file, in whatever programming or markup language you're parsing, is just a long sequence of characters. The "words" that make up the language may be conveniently separated by spaces, or they may not.

For instance in C, the character sequences "foo = bar << 2;" and "foo=bar<<2;" should be considered to be equivalent. The first step in parsing a document is therefore to analyse the sequence of characters and work out where one token ("word") ends and the next one begins.

In my little C example, the tokens are "foo", "=", "bar", "<<", "2" and ";" in both cases. Note the subtlety in this case that it's "<<" and not "<" followed by "<". Tokenisers need to know about the language's syntax, but not it's meaning.

Only once you've tokenised the string can you start to think about what the document means.

Simon B
  • 9,772
2

I agree that the main advantage of tokenizing is speeding up the process. For example, parsing a number is not so simple. You have to check for the decimals, the exponential form, etc... It's something you want to do only once per number.

With tokenization, you ensure that every number will be parsed only once.

Depending on the kind of parser, tokenizing can add complexity (one more parsing step) or remove complexity (splitting the job in two simpler, distinct steps).

Gin Quin
  • 121
1

Lexing and Parsing lend themselves to different formalisms. The goal is to make both tasks easier to program and maintain, as well as faster at runtime.

If you look at (f)lex, the usual lexer-generator, it uses regular expressions to express the lexical rules. This is notationally much more compact than a grammar expressed as BNF or some similar parser specification.

At runtime, a lexer can be baked down to a finite automaton. A parser cannot. So splitting the job speeds up the process. A parser has to be something like LR(k) or LL(1) or LALR to deal with the ambiguity.

The 'dragon book' has been the classic undergraduate-level text on compilation techniques for nearly 30 years.

200_success
  • 1,578
bmargulies
  • 1,717