Are separate parsing and lexing passes good practice with parser combinators?

Question

When I began to use parser combinators my first reaction was a sense of liberation from what felt like an artificial distinction between parsing and lexing. All of a sudden everything was just parsing!

However, I recently came across this posting on codereview.stackexchange illustrating someone reinstating this distinction. At first I thought this was very silly of them, but then the fact that functions exist in Parsec to support this behavior leads me to question myself.

What are the advantages/disadvantages to parsing over an already lexed stream in parser combinators?

score 17 · Answer 1 · edited May 16 '13 at 18:42

Under parsing we understand most often analysis of context free languages. A context free language is more powerful than a regular one, hence the parser can (most often) do the job of the lexical analyser right away.

But, this is a) quite unnatural b) often inefficient.

For a), if I think about how for example an if expression looks, I think IF expr THEN expr ELSE expr and not 'i' 'f', maybe some spaces, then any character an expression can start with, etc. you get the idea.

For b) there are powerful tools that do an excellent job recognizing lexical entities, like identifiers, literals, brackets of all kinds, etc. They will do their work in practically no time and give you a nice interface: a list of tokens. No worries about skipping spaces in the parser anymore, your parser will be much more abstract when it deals with tokens and not with characters.

After all, if you think a parser should be busy with low level stuff, why then process characters at all? One could write it also on the level of bits! You see, such a parser that works on the bit level would be almost incomprehensible. It's the same with characters and tokens.

Just my 2 cents.

score 10 · Answer 2 · edited Jul 11 '14 at 21:21

Everyone suggesting that separating lexing and parsing is a "good practice" -- I have to disagree - in many cases performing lexing and parsing in a single pass gives much more power, and performance implications are not as bad as they're presented in the other answers (see Packrat).

This approach shines when one has to mix a number of different languages in a single input stream. This is not only needed by the weird metaprogramming-oriented languages like Katahdin and alike, but for much more mainstream applications as well, like literate programming (mixing latex and, say, C++), using HTML in comments, stuffing Javascript into HTML, and so on.

Giorgio · Answer 3 · 2012-01-08T10:19:40.907

A lexical analyser recognizes a regular language and a parser recognizes a context-free language. Since each regular language is also context free (it can be defined by a so-called right-linear grammar), a parser can also recognize a regular language and the distinction between parser and lexical analyser seems to add some unneeded complexity: a single context-free grammar (parser) could do the job of a parser and a lexical analyser.

On the other hand, it can be useful to capture some elements of a context-free language through a regular language (and therefore a lexical analyser) because

Often these elements appear so often that they can be dealt with in a standard way: recognizing number and string literals, keywords, identifiers, skipping white space, and so on.
Defining a regular language of tokens makes the resulting context-free grammar simpler, e.g. one can reason in terms of identifiers, not in terms of individual characters, or one can ignore white space completely if it is not relevant for that particular language.

So separating parsing from lexical analysis has the advantage that you can work with a simpler context-free grammar and encapsulate some basic (often routine) tasks in the lexical analyser (divide et impera).

EDIT

I am not familiar with parser combinators so I am not sure how the above considerations apply in that context. My impression is that even if with parser combinators one only has one context-free grammar, distinguishing between two levels (lexical analysis / parsing) could help to make this grammar more modular. As said, the lower lexical-analysis layer could contain basic reusable parsers for identifiers, literals, and so on.

score 4 · Answer 4 · answered Jan 07 '12 at 00:45

Simply, lexing and parsing should be separated because they're different complexities. Lexing is a DFA (deterministic finite automaton) and a parser is a PDA (push-down automaton). This means that parsing inherently consumes more resources than lexing, and there are specific optimization techniques available to DFAs only. In addition, writing a finite state machine is much less complex, and it's easier to automate.

You're being wasteful by using a parsing algorithm to lex.

score 1 · Answer 5 · answered Jan 08 '12 at 04:45

One of the main advantage of separate parse / lex is the intermediate representation - the token stream. This can be processed in various ways that would otherwise not be possible with a combined lex/parse.

That said, I have found that good 'ol recursive decent can be less complicated and easier to work with vs learning some parser generator, and having to figure out how to express the weakness of the grammer within the rules of the parser generator.

Are separate parsing and lexing passes good practice with parser combinators?

5 Answers5

Linked