Recently I’ve been working on libpush, which a new parsing library for C. It has two main features that I think will be valuable: it’s a push parser, which means that instead of parsing a file, stream, or single memory buffer, you supply the data (or “push” it) to the parser in chunks, as it becomes available. I plan to discuss this aspect of the parser in more detail in a later post.
The other main feature is that you design your parsers using combinators. Parser combinators are widely used in Haskell, with Parsec being the most common example. Combinator-based parsing libraries are especially nice in Haskell, because Haskell’s syntax makes them look very simple. For instance, a parser that parses matching nested parentheses is:
parens :: Parser ()
parens = (char '(' >> parens >> char ')' >> parens) <|> return ()
Here, the <|>
operator represents choice: we try parsing the left
operand, and if it fails, then we try the right operand. In our
example, the right operand is the base case, which matches the empty
string. The left operand parses an opening parenthesis; then
recursively calls itself to match any parentheses that might be nested
in the current set; then parses the closing parenthesis; and then
finally tries to match a nested set that occurs after the current set.
When we say that this is a combinator-based parser, we mean that it’s
implemented by taking primitive parsers — in this case char '('
and return ()
— and combining them into more complex parsers using
generic operators like >>
and <|>
.
Now, in order to be able to use combinators like this, parsers have to
be first-class objects in your language. In the Haskell code, the
parsers are represented by the Parser ()
type. In most Haskell
parsing libraries (including Parsec), the parser type is implemented
as a
monad.
Monads have a reputation for being a horribly complex topic, but in
this case, we don’t really need to learn about the underlying math.
Instead, we can just view the monad as letting us do two things
concisely:
Parsers can return a value, which could (for instance) be the
abstract syntax tree that you’re building up while parsing your
language. The monadic bind operator (>>=
) gives you a way to
“pass” these values between parsers, if needed.
Simultaneously, the parser monad maintains the state of the stream you’re parsing from, keeping track of how many bytes remain, whether there’s an error condition, and possibly a nice human-readable description (line and column) of the current location.
This is admittedly a lot of setup; we’ve been talking a lot about Haskell in a post that’s ostensibly describing a C library. But hopefully, this gives you a taste for the kinds of features we want to support in libpush:
Parsers will be represented by a C type. In libpush, this is the
push_callback_t
type.
There will be several primitive parsers; these will be functions
that return a push_callback_t
. The functions can take in
parameters, but none of the parameters will be a push_callback_t
.
(See the char
primitive from above; it needed to take in the
particular character that is expected.)
There will be several combinators; these will be functions that
return a push_callback_t
, and take in other push_callback_t
s as
parameters.
You can see several of these primitives and combinators in action in the libpush Github repository.
We will use something like a monad to take care of passing values between our parsers, and for keeping track of the state of the underlying stream. I say “something like a monad”, because, unlike the Parsec library, the libpush parser type will not be implemented as a monad; in turns out that C is more amenable to implementing them as arrows. In a later post, I’ll explain what this means in terms of writing your own parsers, or for building them up from combinators.