john patrick amata

Parsers, briefly.

02 Jun 2024

Tags: [ language-design  ]

Five days ago there was a post on parsers at a forum I was in that went:

â—‰ How to implement a parser?

Now I dont know much about parsers… so, let’s talk about parsers 💀

After all it gives me a good excuse to learn it too

So after some moments of brief studying…

Basically, a parser takes a stream of data and turns it into a structured format, often an Abstract Syntax Tree (AST). This stream of data is usually a series of tokens—tiny chunks of information tagged with what they represent in the language

Let’s say we’re working with a tiny script written in a simple language, it’s JS-like:

let x = 5;
if (x > 3) {
    x = x + 1;
}

First, we lex (tokenize) this script

Lexing is basically just breaking the script into tokens like so:

let: KEYWORD
x: IDENTIFIER
=: OPERATOR
5: NUMBER
;: DELIMITER
if: KEYWORD
(: DELIMITER
x: IDENTIFIER
>: OPERATOR
3: NUMBER
): DELIMITER
{: DELIMITER
x: IDENTIFIER
=: OPERATOR
x: IDENTIFIER
+: OPERATOR
1: NUMBER
;: DELIMITER
}: DELIMITER

Next, the parser comes into play. It takes these tokens and builds a data structure, usually an AST. Parsers do this by looking at tokens, figuring out what they represent, and recursively building the structure

Let’s see how this works with our if statement

if (x > 3) {
    x = x + 1;
}

Step 1 - Recognizes if

Step 2 - Parsing the condition

Step 3 - Parsing the expression x > 3

Step 4 - Parsing the body

Step 5 - Parsing the block x = x + 1;

After all of these, you got an AST that would look similar to this

IfStatement {
    condition: BinaryExpression {
        left: Identifier { name: x },
        operator: >,
        right: Number { value: 3 }
    },
    body: Block {
        statements: [
            Assignment {
                identifier: Identifier { name: x },
                value: BinaryExpression {
                    left: Identifier { name: x },
                    operator: +,
                    right: Number { value: 1 }
                }
            }
        ]
    }
}

And that’s pretty much it I guess

The AST can then be used for whatever you need—interpreting the code, compiling it, or anything else

Parsers can get pretty complex, especially with different parsing techniques out there (like LL(1) or LR parsers), but at their core, they all break down a stream of tokens into a structured format. This blog post, the parser described is recursive descent, implemented LL(1) (see addendum). It’s all about recognizing patterns and building a corresponding structure

Addendum

I have a python implementation of the parser described here on my github at:

github.com/xjpa/recursive-descent-parser

It parses the sample script above including the let x = 5; and prints a rudimentary visualization of the AST tree

« Why I Write
The Code Monkey's Last Dance »

If you like this post, you can say thanks by following me on đť•Ź @0xj0hn so I can look legit enough to slide my DMs into @taylorswift13

Tweet