Pest: a simple walkthrough of the greatest parser

2022-03-07
#howto

I never thought my software career would involve writing so many parsers, but… here I am.

How I wish that I used parsing expression grammars (or PEGs) before! If you’re like me, you’ve written parsers using regex or some custom reducer functions (or composition thereof). You owe it to yourself to try a PEG.

PEGs are similar to context free grammars with a key difference being that they aren’t ambiguous: if the string parses, there is only one valid solution (i.e. the first match is always chosen when given a choice). This will make more sense below.

Pest is a PEG implemented in Rust. If that’s too hipster for you, I totally understand. I believe there are plenty of PEGs in whatever language you use.

Baby steps

Go to the Pest website and navigate to the sandbox editor (on the bottom of the page as of March 2022).

Try entering a simple grammar:

word = { LETTER+ }

And in the input:

hello

You should see an output that looks something like this:

- word: "hello"

Pest created a “rule” called word, which is defined as one or more letters (upper or lower case). The +, * and ? operators have the same meanings as they do in regular expressions.

Now try adding a phrase rule:

word = { LETTER+ }
phrase = { word ~ (" " ~ word)* }

And for the input…

hello world

The sequential operator ~ can be read as “parse the next thing if the thing before matched.”

The result is a nested output:

- phrase
  - word: "hello"
  - word: "world"

(You may need to select phrase in the drop-down menu as opposed to word.)

Hooking into the API

Now that we’ve built a simple grammar, how do we use the output?

Pest provides an API to access tokens (the start and end positions of a rule) as well as pairs (matching pairs of tokens, e.g. the start and end of a single word). Using pairs is more convenient because you get the entire string matching each rule instead of just a starting or ending position.

First, we’ll add the required pest dependencies to our Cargo.toml:

...
[dependencies]
pest = "2.0"
pest_derive = "2.0"

We place the above grammar into src/hello.pest, and then place the following code into src/main.rs:

use pest::Parser;
#[macro_use]
extern crate pest_derive;

#[derive(Parser)]
#[grammar = "hello.pest"]
struct HelloParser;

fn main() {
    let pairs = HelloParser::parse(Rule::phrase, "Hello world")
        .unwrap()
        .flatten();
    for pair in pairs {
        println!("{:?}",pair);
    }
}

Pest automatically derives a Parser struct from the grammar in hello.pest. The resulting struct exposes a parse method.

Since the grammar contains nested rules, we flatten the resulting iterator to print out all pairs, including the overall phrase pair.

Here’s the output:

Pair { rule: phrase, span: Span { str: "Hello world", start: 0, end: 11 }, inner: [Pair { rule: word, span: Span { str: "Hello", start: 0, end: 5 }, inner: [] }, Pair { rule: word, span: Span { str: "world", start: 6, end: 11 }, inner: [] }] }
Pair { rule: word, span: Span { str: "Hello", start: 0, end: 5 }, inner: [] }
Pair { rule: word, span: Span { str: "world", start: 6, end: 11 }, inner: [] }

Since we probably don’t need the phrase rule’s output in this example, let’s make it “silent” by placing an underscore before it in src/hello.pest:

phrase = _{ ... }

Now the output just shows us matching words:

Pair { rule: word, span: Span { str: "Hello", start: 0, end: 5 }, inner: [] }
Pair { rule: word, span: Span { str: "world", start: 6, end: 11 }, inner: [] }

Ramping it up

As things get more complicated, I find it helpful to remember the following formula (from the Pest docs):

(this ~ next_thing) | (other_thing)
  1. Try this
  2. If it succeeds, try the next thing.
  3. Otherwise, try the other thing.

Thinking like this has helped me debug a lot of parsing rules. Rules can be very simple but very powerful.

Parsing XML in the wild

Let’s take something more complicated, like malformed XML. You’ll find malformed XML commonly throughout the web (missing end tags, no root tag, etc), so maybe it’s better to just call it “XML”. Anyway, here’s an example from a Vietnamese RSS news feed:

<a href="https://vnexpress.net/kim-jong-un-cuoi-bach-ma-trong-phim-tai-lieu-4423584.html">
	<img src="https://vcdn1-vnexpress.vnecdn.net/2022/02/02/kimjongun14jpg-1643775018-1643-7362-2026-1643775033.png?w=1200&h=0&q=100&dpr=1&fit=crop&s=_EXcJy1MyHP8I88yj-7Qyg" >
</a>
</br>
Kim Jong-un cưỡi bạch mã trong phim tài liệu phản ánh một năm ông đưa Triều Tiên vượt qua thời kỳ khó khăn chưa từng thấy.

You could use some XML parser already out there, but you’ll quickly find that not all XML parsers are the same:

  • Does it require matching start and end tags?
  • Does it require a root tag enclosing everything?
  • How is the parser data output? Is it a tree structure, an event-based API, or something else?

Sometimes it’s easier to just write one yourself. Enter Pest…

For my quick-and-dirty XML parser, I wanted to pull out any text that was not within a tag. I started with a grammar that looked something like this:

start_tag = { "<" ~ (!">" ~ ANY)+ ~ ">" }
end_tag = { "</" ~ (!">" ~ ANY)+ ~ ">" }
text = { (!"<" ~ ANY)+ }
xml = { start_tag ~ (text | xml)* ~ end_tag }

There are a few new concepts here:

  • ANY matches any single character, as expected
  • the !">" syntax is a negative lookahead for ">", matching the current position only if the next character is not >
  • The negative lookahead does not consume anything by itself: you need the ANY to actually consume the next character (otherwise it will just put the parser in an infinite loop)

This is by no means a perfect solution for an XML parser, but it is something I could very quickly wrap my head around and tune to fit my needs.

Easy to tweak

Since I only needed to get any text outside of a tag, I quickly found that the above parser was a bit overkill.

No problem! I could quickly tweak it to make it simpler:

...
xml = { (start_tag | text | end_tag)+ }

This flat structure is definitely not “correct” XML, but if you were trying to parse “correct” XML, you wouldn’t build your own parser in the first place. This solution was quick to update, test and it’s easy to read next time I come back to it.

The point here is that the grammar was very easy to adapt and debug compared to, say, a regex or a state machine, and I didn’t have to touch any Rust code to make the change.

Summary

There are many ways you can parse data. The main ones I’ve come across are:

  • a domain-specific parser optimized for a standard format (e.g. an XML parser like libxml2 in C or lxml in Python)
  • Regular expressions
  • Some custom state machine (e.g. a reducer function)
  • a PEG, like Pest

I find myself switching between regexes, domain-specific parsers and now PEGs. From my experience, PEGs are at that happy medium between expressivity and readability. I can use them anywhere I would typically use a regex, but they’re easier to write, interpret and debug.

If you have strong opinions about PEGs, I’d love to hear what you think! Feel free to reach out to me at @dinosaurcoder.