2020-03-26

Now I have two problems ...

… or why is parsing so @$#^ hard?

So, your design calls for you to take input in a non-trivial text format. What next?
There are lots of ways to proceed: you can hand tool a finite-state-machine or even a recursive descent parser, you can break out some compiler construction toolkit, or you could–just maybe–try to do it with regular expressions.
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
Jamie Zawinski
Hand tooling a FSM works great for little problems, and big problems call for either a compiler-compiler or that hand-tooled recursive descent processor.1

But what should I do for medium-sized problems in C++?

Doing it by hand generates a lot of fiddly code to write, test, debug and maintain. The basic problem is that the hand-written code includes a lexing step which is at the wrong level of abstraction: you’re dealing with characters when you should be dealing with tokens.

The purpose of compiler construction tool kits is exactly to let you think at the higher level of abstractions. But they mostly require a separate processing step in your build2 and they all represent another build dependency. And some are hard to get or install on multiple target platforms (I have to support linux and windows at work, including both new and older version of both).

It this point it is very tempting to use regular expressions to implement your lexer or even the whole lexing-parsing pass. Which is generally a bad idea.3

Remember that I’m assuming here you have a non-trivial input syntax. What that means on a practical level is that you have decision making to do as you parse and that means either a lot of little regular expressions wrapped up in code to do the decision making or a smaller number of very complex regular expressions with lots of alternatives and optional groups. Either way it’s going to be hard to understand, hard to debug, and hard to maintain.

How we got here, maybe?


Back in the day (and I really mean the misty depths of history, even before my time), computers were slow and expensive, and programmer time was relatively cheap. And lexically analyzing a file was notorious for being “slow”. In those days a lot of effort went into writing fast lexers. And the same structure was then used for parsers. Why not?

But these systems are complex, and they require a separate step in the build chain.

Even today you wouldn’t willing use a slow version of a component that took up a lot of your run time if a fast version was available. So serious compiler construction kits all advertise their speed.

Why I don’t care


In this case I’m talking about processing mini-languages as a small part of a larger work flow, not traditional compiling where the processing of the input (through several steps) is the work. This kind of parsing is not generally a big part of my run-time. I don’t actually care if it’s a bit slow.

What I’d like


I’d like a class that accepts a grammar and a input and produces a stream of tokens. The grammar could be a run-time input or in the template case a compile time input. But if it is compiled it should not require a separate processing step (i.e. it should be implemented in the template meta processor).
The grammar should include optional function-objects (lambdas, std::function, whatever …) to be run just before emitting the related token (and as an implementation detail, I’d pass a state object for those functors to act on).

What I have in mind is an interface something like
struct rule;

template <typename STATE, 
          typename STRING = std::string, 
          typename TOKEN = uint_16>
class lexer
{
public:
   lexer(const std::vector<rule> &grammar);
   ~lexer() = default;
   
   TOKEN scan(STRING::const iterator &it, STATE &state) const;
};
where the user-provided state-class collects all the information other than the token stream itself.

And I’d like a similar class for parsing.

All in the standard library or available as a header-only implementation, so that adding a lexing/parsing step to your project is easy if you can accept a speed penalty.



1 I notice several compiler-compilers producing recursive descent code these days. That’s a change from when I first studied this problem; back then everything seemed to be table driven.

2 Boost spirit is an exception here, but it imposes a significant compile time penalty, and the need to build it out of c++ operators and live with their precedence structure results in a weird DSL.

3 Don’t get me wrong. I’m with Jeff on this: regular expressions are a must-have tool, but you have to keep them in their place.
Written with StackEdit.

No comments:

Post a Comment