0

(TLDR) To build a parser for a recursive grammar by composition of individual parsers (e.g. with a parser combinator framework), there are often circular dependencies between some of the individual parsers. While circular dependencies are generally a sign of bad design, is this a valid case where the circular dependency is inevitable? If so, which solution would be preferable to deal with the circular dependency? Or are parser combinators just a bad idea altogether? (/TLDR)


There are other questions asking about dependency injection with circular dependencies. Typically, the answer is to change the design to avoid the circularity.

I have come across a typical case where I encounter circular dependencies: If I want to have different services to inspect a recursive structure.

I have been thinking about other examples, but so far the best I come up with is a parser for a recursive grammar. Let's use json as an example, because it is simple and well-known.

  • A json "value" can be a string (".."), an object ({..}) or an array ([..]).
  • A string is simple to parse, and has no further children.
  • An object is made up of keys and values, and the surrounding object syntax.
  • An array is made up of values, and the surrounding array syntax.
  • A "key" within an object is basically the same as a string.

Now I am going to create a number of parser objects:

  • A value parser, which depends on the other 3 parsers.
  • An object parser, which depends on string parser and value parser.
  • An array parser, which depends on value parser.
  • A string parser.

I want to manage the 4 parsers with a dependency injection container. Or, even if I don't want a container, we still have to figure out in which order to create the different parsers. There is a chicken-and-egg problem.


There are known solutions to this, which can be observed in existing parser libraries. So far I have mostly seen the "stub" solution.

  1. Avoid the circular dependency..

.. by passing the value parser as an argument to the object and array parsers' parse() method.

This works, but it taints the signature of the parse() method. Imagine we want this to be something like a parser combinator, which can be reused for other grammars. We would want a parser interface that is generic and independent of the specific grammar, so we can't have it require a specific parser to be passed around.

  1. Use a stub.

Instead of requiring each dependency in the constructor, we could use a set() or add() method on one of the parsers. E.g. we first create an empty value parser ("stub"), and then add the object, array and string parsers to it via the add() method.

  1. Use a proxy.

Instead of creating the actual value parser, we create a proxy object, with a reference to the container. Only when the parse() method is called the first time on the proxy value parser, the real value parser is created.


Now this is all fine, and I suppose it is just a matter of taste, which solution I prefer.

But how does this fit with the typical high-horse response that circular dependencies are a sign of bad design? The example seems totally valid, and there is an entire class of problems like this.

4 Answers4

6
  1. Don't make 4 parsers; make 1 parser. You're not parsing 4 different languages; you're parsing 1 language with 4 major grammatical components. The "circular dependency problem" can be handled quite easily with a bog-standard recursive-descent parser whose ParseObject and ParseValue methods can both call each other.
Mason Wheeler
  • 83,213
1

First, there is no way of expressing a JSON parser in a way that is not ultimately "circular". We can however, stave off the circularity. To express this in a simpler way, we use a simple formal language defined as

array = squareArray | angleArray
squareArray = "[" array* "]"
angleArray = "<" array* ">"

and a corresponding type (in pseudo-code)

type Array = SquareArray List[Array] | AngleArray List[Array]

We then define a simple parser of this type (using a fictional monadic parser library).

arrayParser : Parser[Array]
arrayParser = try squareArray `or` angleArray

squareArrayParser : Parser[Array]
squareArrayParser = 
      from start  in parseText "["
      from arrays in many arrayParser
      from end    in parseText "]"
      select SquareArray arrays

angleArrayParser : Parser[Array] 
angleArrayParser = 
      from start  in parseText "<"
      from arrays in many arrayParser
      from end    in parseText ">"
      select AngleArray arrays

There is, to my mind, absolutely nothing wrong with the above definition. If we are dead set on removing circularity, we can note that arrayParser, squareArrayParser and angleArrayParser are all defined recursively. We can factor out this recursive nature by first making each take the other as a parameter.

arrayParserF : forall t. Parser[t] -> Parser[t] -> Parser[t]
arrayParserF pSquareArray pAngleArray = try pSquareArray `or` pAngleArray

squareArrayParserF : forall s, t. (s -> t) -> Parser[s]  -> Parser[t]
squareArrayParserF pSquareArray pArrayParser = 
      from start  in parseText "["
      from arrays in many pArrayParser
      from end    in parseText "]"
      select pSquareArray arrays

angleArrayParserF : forall s, t. (s -> t) -> Parser[s] -> Parser[t]
angleArrayParserF pAngleArray pArrayParser = 
      from start  in parseText "<"
      from arrays in many pArrayParser
      from end    in parseText ">"
      select pAngleArray arrays

We see that we can immediately get back what we started with by saying

arrayParser' : Parser[Array]
arrayParser' = arrayParserF squareArrayParser' angleArrayParser'
  where
    squareArrayParser' : Parser[Array]
    squareArrayParser' = squareArrayParserF SquareArray arrayParser' 

    angleArrayParser' : Parser[Array]
    angleArrayParser' = angleArrayParserF AngleArray arrayParser'

Or equivalently

arrayParser' : Parser[Array]
arrayParser' = (\ pArrayParser -> 
                  arrayParserF 
                    (squareArrayParserF pArrayParser)
                    (angleArrayParserF pArrayParser)) arrayParser'

Or equivalently (for fun)

arrayParser' : Parser[Array]
arrayParser' = fix makeArrayParser SquareArray AngleArray

makeArrayParser : ([t] -> t) -> ([t] -> t) -> Parser[T] -> Parser[T]
makeArrayParser pSquareArray pAngleArray pArrayParser = 
    arrayParserF 
        (squareArrayParserF pSquareArray pArrayParser)
        (angleArrayParserF  pAngleArray  pArrayParser)

While this is mostly academic, it is not entirely useless, as we can now very easily define a parser that parses up to some fixed nesting of brackets.

fixedArrayParser : Parser [Array]
fixedArrayParser 1 = makeArrayParser SquareArray AngleArray blankParser
fixedArrayParser (n + 1) = makeArrayParser SquareArray AngleArray (fixedArrayParser n)

We can even get more creative with our combinators and interpret our language as expressions of sums and products

sumsAndProductParser : Parser [int]
sumsAndProductsParser = fix (makeArrayParser product sum)

So factoring out our parser into subparsers can help us achieve flexibility and usability -- it does make it much easier to parse restrictions and extensions of the original grammar -- but it does so at a cost of complexity. If all you want to do is parse JSON then this may well be too much.

walpen
  • 3,241
1

Many grammars have mutually recursive production rules, JSON being a prime example. It is not a mark of a bad design.

With regards to using parser combinator frameworks, if the host language allows mutually recursive definitions (e.g. Haskell) , then there is no problem. In languages which do not (e.g. Java), then one trick is to allow for lazily initialised parsers - I used this in my framework. An example of it's usage, for a JSON grammar, is here.

The lazily initialised parser is jvalue:

private static final Parser.Ref<Character, Node> jvalue = Parser.ref();

which is declared at the beginning of the grammar, and initialised at the end:

static {
    jvalue.set(
        choice(
            jnull,
            jbool,
            jnumber,
            jtext,
            jarray,
            jobject
        ).label("JSON value")
    );
}

The Parser.Ref type is itself a Parser (i.e. it implements the Parser interface).

One of the prime advantages of the parser combinator approach is that it provides a DSL for defining grammars, which is hosted in the programming language you're already using. This itself has benefits, such as making the semantic actions associated with your grammar production rules subject to the type-checking provided by the host language - i.e. your rules have to be well-typed. Another benefit is being able to extend the DSL with your own combinators.

0

Recursive data structures are always a bit tricky, and you will either need some kind of laziness, or to mutate your data structure after initial construction. Of these two, mutation is simpler, and can be easily used to implement laziness. However, external set()ers lead to quite fragile design where you are forced to perform easily forgettable, manual steps to make sure your parser object has been properly initialized.

For that reason, I would consider a dependency container that is queried lazily to be the far cleaner solution:

class ObjectParser(parsers) {
    parse(input) { ... parsers.getArrayParser().parse(input) ... }
}

class ArrayParser(parsers) {
    parse(input) { ... parsers.getObjectParser().parse(input) ... }
}

class Parsers {
   // Instead of creating a new object each time,
   // you could use the getter to lazily initialize some field.
   getArrayParser()  { return new ObjectParser(this) }
   getObjectParser() { return new ArrayParser(this) }
}

new Parsers().getObjectParser().parse(input)

Note that the parser “objects” ObjectParser and ArrayParser only have a single public method parse() in your design, so they would actually be equivalent to functions or closures. We can inline parsres.getArrayParser().parse(input) to parsers.parseArray(input):

class Parsers {
  parseArray(input) { ... this.parseObject(input) ... }
  parseObject(input) { ... this.parseArray(input) ... }
}

new Parsers().parseObject(input);

So magically, the latter code is exactly equivalent and equally extensible, but uses far less code. Why? Because the host language object system already introduces the required indirection to make this work: the implicit this parameter (previously the explicit parsers) is already pointer-like, and method lookup must be done fairly lazily if you might have subclasses (“late binding”). But we don't even need that for a recursive descent parser, if our language allows us to pre-declare functions, or does not even need predeclarations. In C, this would work as well:

// predeclarations
ResultT parseArray(InputT input);
ResultT parseObject(InputT input);

// implementation
ResultT parseArray(InputT input) { ... parseObject(input) ... }
ResultT parseObject(InputT input)  { ... parseArray(input) ... }

Where does the indirection come from now? From the compiler: due to the predeclarations, a call to one function can be compiled before the call target is known. This is fairly straightforward if both functions are in the same compilation unit, otherwise a linker is required to “tie the knot”.

If you are just writing a recursive descent parser (which works for JSON but doesn't work well for more complicated languages that aren't LL(1)), then using the simplest thing that could work is the solution you should choose: no mutability, no proxy objects, not virtual dispatch, just mutually recursive procedures.

If you want to use a parser generator that simulates a recursive descent parser for you, you will need some values or objects to pass around. Should your language happen to support functional programming or at least higher-order functions, nothing changes because you can just pass handles to those parser function around (e.g. function pointers in C, method objects in Java8, …). If your language is more purist OOP (e.g. Java < 8, C++ < 11), you will need the explicit objects ObjectParser, ArrayParser etc. as in the first example. Only then would I consider this massive waste of code to be a viable solution.

amon
  • 135,795