Learn Rust by building a Brainfuck interpreter, part 3: how you should not build a parser

Introduction

This is the third post in a series that builds a Brainfuck interpreter in Rust.

  • In the first post, we created the first building blocks for our Brainfuck interpreter in Rust: a data structure for the tape, and a data type for the instructions. The instructions are based on Rust's enums, which allow to build powerful algebraic data types.
  • In the second post, we built an execution engine that can execute a Brainfuck program which is represented by an abstract syntax tree in the context of a tape. We used the Read and Write traits to make our execution function generic, such that we could not only read from standard input and write to standard output, but also, e.g., read from byte slices and write to a vector of bytes. This makes unit testing easy.

Currently, we can only execute programs which are given as abstract syntax trees in Rust data structures, such as this one:

// read two byte values and write their sum
vec![
    Read, Right, Read,
    Loop(vec![Dec, Left, Inc, Right]),
    Left, Write
];

This corresponds to ,>,[-<+>]<. in Brainfuck source code.

In the current post, we will find ways to transform the latter into the former, i.e., to parse Brainfuck source code and generate an abstract syntax tree.

Please note: I intentionally did this without using libraries because I wanted to build a simple parser from scratch. Since I learned a lot this way, I found it appropriate to write a blog post about the process. However, building a parser with the help of libraries that are designed for this purpose is much easier and less error-prone. In a future post in this series, we will implement a better parser using the Rust crate nom.

Importing what we did so far into a Jupyter notebook

I've taken the code from the Jupyter notebooks that the first posts in this series were based on, and copied it to proper Rust files in the directory for the new blog post.

Just like in the last post, we will import these into the Jupyter notebook that is the source of the current blog post:

In [2]:
:dep rust_bf = { package = "rust-bf", path = "." }
    
use rust_bf::{instructions::{Instruction, Instruction::*}, tape::Tape, executor::execute};

A small extension compared to the last post is that the function that executes Brainfuck code now creates an empty Tape, so we can run programs more easily. We'll show this with a somewhat silly example program that reads a number $n$, then reads $n$ bytes, decreases each of them by one, and writes them. Finally, it writes a newline character (\n):

In [3]:
let transform_input = vec![
    // read number of items (n)
    Read,
    
    Loop(vec![
        // decrease n by one
        Dec, 

        // go right, read input, decrease by one, write output, go left
        Right, Read, Dec, Write, Left
    ]), 

    // set current cell to 10 (ASCII code for newline, \n)
    Inc, Inc, Inc, Inc, Inc, Inc, Inc, Inc, Inc, Inc,

    // write output
    Write
];

We will now execute this program and see what it does with the byte sequence b"\x0cIfmmp!Xpsme\"" as input:

In [4]:
let encrypted_message = b"\x0cIfmmp!Xpsme\"";
let mut m: &[u8] = encrypted_message;
execute(&transform_input, &mut m, &mut std::io::stdout());
Hello World!

So the value of encrypted_message is just the input that transform_input needs to output "Hello world!".

With the generic function execute, we can mix and match input and output objects as long as they implement the traits Read and Write, respectively. In particular, we can collect the output in a Vec<u8>, i.e., a vector of bytes. The following function will be useful, which uses a byte slice as input and writes the output both as a byte slice and as a string if the bytes contain valid UTF-8 data:

In [5]:
fn execute_with_input(program: &[Instruction], input: &[u8]) {
    let mut output = Vec::new();
    let mut input: &[u8] = input;

    execute(program, &mut input, &mut output);

    println!("Output as bytes: {:?}", output);
    println!("Output as str:   {:?}", std::str::from_utf8(&output));
}

We will test it with the program transform_output and the encrypted message from above:

In [6]:
execute_with_input(&transform_input, encrypted_message);
Output as bytes: [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33, 10]
Output as str:   Ok("Hello World!\n")

Step 1: parse programs without loops

Since the hardest part of parsing Brainfuck code is to handle loops properly, we will start simple and first parse loop-less programs. A simple example is a variant of transform_input which does not read the byte count, but uses a hard-coded number of input bytes instead. Then we can unroll the loop. For 12 input bytes, which happens to be the length of the string Hello World!, this is equivalent to the following source code. Note that all characters which are not mapped to Brainfuck instructions are considered comments:1

In [7]:
let transform_12_input_bytes_source = 
b"  ,-.         Repeat 12 times:
    ,-.         * read a number
    ,-.         * decrease by one
    ,-.         * write it
    ,-.
    ,-.
    ,-.
    ,-.
    ,-.
    ,-.
    ,-.
    ,-.

    >           go right to an empty cell
    ++++++++++  store 10 (\n)
    .           write output
";

We will split the parsing process into two functions, the first of which takes a single input character and returns an optional instruction:

In [8]:
fn parse_simple_instruction(c: &u8) -> Option<Instruction> {
    match c {
        b'<' => Some(Left),
        b'>' => Some(Right),
        b'+' => Some(Inc),
        b'-' => Some(Dec),
        b',' => Some(Read),
        b'.' => Some(Write),
        b'[' | b']' => panic!("parse_simple_instruction cannot handle loops"),
        _ => None  // everything else is a comment
    }
}

Option<T> is an enum which has the two variants

  • None, which corresponds to an empty value,
  • Some(T), which holds a value of type T.

This concept will look familiar to readers who have used optional values in other languages, like, e.g., std::optional in C++ or Maybe in Haskell.2

We want to apply this function to every character in the Brainfuck source code. This could be done with a for loop, but it is easier to make use of iterators by applying a sequence of functions to transform the source into the result that we want. I will not describe the basics of iterators in great detail here (see the relevant chapter in The Rust Programming Language or the standard library documentation), but much of what you can do with iterators should be easy to follow for readers who have experience with, e.g., streams in Java, ranges in C++, list transformations in Haskell, or similar constructs in other languages. Here we parse a small Brainfuck source snippet including comments:

In [9]:
&b".-, comment"
    .iter()
    .map(parse_simple_instruction)
    .collect::<Vec<_>>()
Out[9]:
[Some(Write), Some(Dec), Some(Read), None, None, None, None, None, None, None, None]
  • .iter() creates an iterator that allows to iterate through the input bytes,
  • .map(parse_simple_instruction) applies the function above to each element,
  • .collect::<Vec<_>>() collects the transformed elements into a Vec. Note that the element type need not be specified because the compiler can deduce that it is Option<Instruction>.

We see that parsing the instructions worked. It's just a bit inconvenient that the handling of comments results in each instruction being wrapped in Some, and that the None values from comments and whitespace are in the result.

We could fix this by

  • First filtering out the None values. In the code snippet below, this is done with a closure that calls the is_some member of Option. We could also have defined a separate function (as we did with parse_simple_instruction) for this purpose, but using a small anonymous function (also called lambda in other languages) is often easier.
  • And then unwrapping the instructions from Some(...). This works similarly to unwrapping valid results from the Ok variant of Result, which we did in the previous post in this series:
In [10]:
b".-, comment"
    .iter()
    .map(parse_simple_instruction)
    .filter(|opt| opt.is_some())
    .map(|opt| opt.unwrap())
    .collect::<Vec<_>>()
Out[10]:
[Write, Dec, Read]

This works just fine, but the following solution is a bit more elegant:

In [11]:
b".-, comment"
    .iter()
    .filter_map(parse_simple_instruction)
    .collect::<Vec<_>>()
Out[11]:
[Write, Dec, Read]

filter_map is a member of Iterator that does just what we need:

  • it applies the given function,
  • it discards the None values,
  • and it unwraps each value that is wrapped in Some(...).

Now we can build a function that can parse any Brainfuck program without loops. Note that we do not have to specify that .collect() shall collect the resulting iterator into a Vec because the compiler deduces this from the return type of the function:

In [12]:
fn parse_program_without_loops(input: &[u8]) -> Vec<Instruction> {
    input.iter()
        .filter_map(|c| parse_simple_instruction(c))
        .collect()
}

We will now parse the 12 byte transformation program above and execute it to verify that it works:

In [13]:
let transform_12_input_bytes = parse_program_without_loops(transform_12_input_bytes_source);
execute_with_input(&transform_12_input_bytes, b"Ifmmp!Xpsme\"");
Output as bytes: [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33, 10]
Output as str:   Ok("Hello World!\n")

Step 2: parse programs with loops

One interesting aspect of loops is that each instruction that starts a loop, [, must have a matching loop end, ], and vice versa. Therefore, not every combination of characters forms a valid Brainfuck program, and it would be good if our parser could detect invalid programs and report the first error in the source code.

There are two possible kinds of errors:

  • There could be an unmatched loop end, i.e., a ] where no loop was currently active. In this case, we will report the index of the unexpected ] in the sequence of source bytes.
  • There could be an unmatched [. This error could be fixed by adding an ] at the end, so we will just report "missing loop end at end of input". We could also try to find out the index of the unmatched [, but this would complicate the parsing code. We will see in a future post that this is much easier when using the nom crate for parsing.

We will use an enum to represent the different error conditions:

In [14]:
#[derive(Debug)]
enum ParseError {
    MissingLoopEndAtEndOfInput,
    UnexpectedLoopEnd(usize)
}

To build a full Brainfuck parser, we will complement the function parse_simple_instruction with three more functions:

  • parse_next_instruction parses a single instruction, which can either be a simple instruction, or a loop.
  • parse_loop_body parses the body of a loop, i.e., the code between [ and ].
  • parse parses an entire Brainfuck program.

These functions all have a similar signature, which we will discuss in a minute:

In [15]:
fn parse_next_instruction(source: &[u8]) -> Result<(Option<Instruction>, &[u8]), ParseError> {
    todo!()
}

fn parse_loop_body(source: &[u8]) -> Result<(Vec<Instruction>, &[u8]), ParseError> {
    todo!()
}

fn parse(source: &[u8]) -> Result<(Vec<Instruction>), ParseError> {
    todo!()
}
  • They take a byte slice, &[u8], as the single argument. This is the source code which still needs to be parsed.
  • They return a Result, depending on whether parsing was successful or not.
  • The Err variant contains a ParseError which tells what the problem is.
  • The Ok variant contains the instructions which were parsed successfully, i.e.,
    • a Vec<Instruction> for parse and parse_loop_body, and
    • an Option<Instruction> for parse_next_instruction, which parses at most one instruction. It will be None if the next byte in source is a comment.
  • Moreover, since parse_next_instruction and parse_loop_body do not parse the entire program, they also have to return the remaining source code which still has to be parsed. Therefore, these functions return a tuple with two elements in the Ok variant.

Small digression: alternatives to match expressions

Since each of the functions that were outlined above returns a Result, and the return value of parse_next_instruction contains an Option<Instruction> which is wrapped in the Result, we have to handle the different variants of these enum values (Ok and Err for Result, Some and None for Some) when these functions call each other.

In the previous post, we saw how this can be done with pattern matching in match expressions. However, there are simpler alternatives to deal with enums in some cases:

if let

If we want to execute some code only if an enum value matches a specific pattern, we can use the if let construct. For example, a function that takes a slice of integer values and prints the first value if the slice is not empty could be implemented like this:

In [16]:
fn print_first_if_not_empty(values: &[i32]) {
    if let Some(first) = values.iter().next() {
        println!("First value: {}", first);
    }
}
In [17]:
print_first_if_not_empty(&[]);  // no output
In [18]:
print_first_if_not_empty(&[3, 4, 4]);
First value: 3

let-else

Another interesting situation is the following: if a value matches a pattern, then we want to assign a variable based on the value, and use this variable in the remainder of the current code block. Otherwise, we want to leave the current block with something like return, break, continue, or panic!. This can be achieved like this:

In [19]:
fn double_first_value(values: &[i32]) -> Option<i32> {
    let Some(first) = values.iter().next() else {
        eprintln!("No values found.");  // prints to stderr
        return None;
    };

    return Some(2 * first);
}
In [20]:
double_first_value(&[])
No values found.
Out[20]:
None
In [21]:
double_first_value(&[3, 4, 5])
Out[21]:
Some(6)

Propagating errors with the ? operator

Sometimes, a function has to handle a Result value where only the Ok variant can be used for further processing in a meaningful way, and an Err should be returned to the caller directly. Similarly, functions which have to handle an Option might not be able to do anything with a None value and want to just pass it to the caller. To express this in a concise way, Rust has the ? operator.

A question mark after a Result or Option value will try to unwrap the contained Ok or Some value, respectively, and return any Err or None values to the caller directly.

A simple example would be a function that doubles an optional integer:3

In [22]:
fn double(x: Option<i32>) -> Option<i32> {
    Some(2 * x?)
}

If x is a Some value, then the expression x? will be equal to the contained value. Otherwise, None will be returned from the function immediately:

In [23]:
double(Some(4))
Out[23]:
Some(8)
In [24]:
double(None)
Out[24]:
None

Even though it can be used to handle Option values, the ? operator is more interesting for Result values:

In [25]:
fn to_u64(x: Result<f64, &str>) -> Result<u64, &str> {
    let x: f64 = x?;
    if x < 0.0 || x > u64::MAX as f64 {
        Err("value outside u64 range")
    } else {
        Ok(x.round() as u64)
    }
}

Ok values that contain a number will just be unwrapped for further processing:

In [26]:
to_u64(Ok(3.5))
Out[26]:
Ok(4)
In [27]:
to_u64(Ok(-1.0))
Out[27]:
Err("value outside u64 range")

Err input values will be returned immediately:

In [28]:
to_u64(Err("invalid number"))
Out[28]:
Err("invalid number")

Implementing the functions for our parser

Now that we know how to handle Option and Result values without match expressions, let us start with the implementation of parse_next_instruction. It will be called by the other two functions, but only if there is still something to parse. Moreover, the callers will be responsible for handling loop ends. So we can safely assume that we can extract the first character from the source argument, and that it is not ]. We will just panic if these conditions are not fulfilled:

In [29]:
fn parse_next_instruction(source: &[u8]) -> Result<(Option<Instruction>, &[u8]), ParseError> {

    // Extract first character
    let Some(c) = source.iter().next() else {
        panic!("parse_next_instruction() should not be called with an empty source.");
    };

    // Store everything after this character in remaining_source
    let remaining_source = &source[1..];

    match c {
        b']' => panic!("loop end should be handled in the caller"),

        // Loop: wrap the instructions from the loop body in Loop(...) and forward
        // the code after the loop end to the caller for further processing
        b'[' => parse_loop_body(remaining_source)
                .map(|(instructions, source_after_loop)|
                     (Some(Loop(instructions)), source_after_loop)),

        // Everything else is straightforward to parse :-)
        c => Ok((parse_simple_instruction(c), remaining_source))
    }
}

As you can see in the match arm that parses a loop, the method map(...) can be applied to the Result that is returned by parse_loop_body. This means that the mapping is applied only if the Result contains an Ok value. An Err value is left as it is.

We can now implement the function which parses a loop body. Since it might have to parse more than one instruction, we process the source code in a loop, and store the instructions that are found in a mutable Vec:

In [30]:
fn parse_loop_body(source: &[u8]) -> Result<(Vec<Instruction>, &[u8]), ParseError> {
    let mut result: Vec<Instruction> = Vec::new();
    let mut remaining = source;

    while remaining.len() > 0 {
        if remaining[0] == b']' {
            // We have found the loop end. Return the parsed instructions and the code after ']' to the caller.
            return Ok((result, &remaining[1..]));
        }

        // Try to parse the next instruction. Forward any errors to the caller.
        let (opt_instruction, source_after_instruction) = parse_next_instruction(remaining)?;

        if let Some(instruction) = opt_instruction {
            // The character was a valid instruction, and not a comment.
            result.push(instruction);
        }

        remaining = source_after_instruction;
    }

    // The expected loop end was not found :-(
    Err(ParseError::MissingLoopEndAtEndOfInput)
}

Before we proceed with parsing an entire Brainfuck program, it would be good if we could test what we have implemented so far. We will set up a function that takes some Brainfuck source code, and prints the first instruction that is parsed by parse_next_instruction (if any) and the remaining source code, or the error that was encountered:

In [31]:
fn test_parse_next_instruction(source: &[u8]) {
    match parse_next_instruction(source) {
        Ok((opt_instruction, remaining_source)) => {
            println!("instruction: {:?}", opt_instruction);
            println!("remaining:   {:?}", std::str::from_utf8(remaining_source).unwrap());
        }
        Err(error) => {
            println!("error: {:?}", error);
        }
    }
}

First we will test parsing a simple instruction:

In [32]:
test_parse_next_instruction(b"+-<>");
instruction: Some(Inc)
remaining:   "-<>"

If the source starts with a comment, parse_simple_instruction returns None and the remainder of the source which still needs to be parsed:

In [33]:
test_parse_next_instruction(b"comment ...");
instruction: None
remaining:   "omment ..."

An unterminated loop will result in an error:

In [34]:
test_parse_next_instruction(b"[");
error: MissingLoopEndAtEndOfInput

A correct loop is parsed as a single instruction:

In [35]:
test_parse_next_instruction(b"[-<.]>>>>");
instruction: Some(Loop([Dec, Left, Write]))
remaining:   ">>>>"

Nested loops also work just fine:

In [36]:
test_parse_next_instruction(b"[+++[>+<-]]");
instruction: Some(Loop([Inc, Inc, Inc, Loop([Right, Inc, Left, Dec])]))
remaining:   ""

Finally, we will implement the function parse, which parses a full program:

In [37]:
fn parse(source: &[u8]) -> Result<(Vec<Instruction>), ParseError> {
    let mut result: Vec<Instruction> = Vec::new();
    let mut remaining = source;

    while remaining.len() > 0 {
        if remaining[0] == b']' {
            // Loop ends are only expected in parse_loop_body :-(
            return Err(ParseError::UnexpectedLoopEnd(source.len() - remaining.len()));
        }

        // Try to parse the next instruction. Forward any errors to the caller.
        let (opt_instruction, source_after_instruction) = parse_next_instruction(remaining)?;

        if let Some(instruction) = opt_instruction {
            // The character was a valid instruction, and not a comment.
            result.push(instruction);
        }

        remaining = source_after_instruction;
    }

    Ok(result)
}

First, we will test if the correct error is returned for some invalid programs:

In [38]:
parse(b"]")
Out[38]:
Err(UnexpectedLoopEnd(0))
In [39]:
parse(b"[[[]]]]")
Out[39]:
Err(UnexpectedLoopEnd(6))
In [40]:
parse(b"[")
Out[40]:
Err(MissingLoopEndAtEndOfInput)

Now let us try a valid non-trivial program to see if our parser works as expected.

The following program reads a byte $n$, and then outputs the first $n$ Fibonacci numbers:

In [41]:
let fib = parse(b"
,>>+<<
[->
  [->>+<<]
  >
  [-<+>>+<]
  >
  [-<+>]
  <<.<
]").unwrap();
In [42]:
execute_with_input(&fib, &[11]);
Output as bytes: [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
Output as str:   Ok("\u{1}\u{1}\u{2}\u{3}\u{5}\u{8}\r\u{15}\"7Y")

So everything seems to work fine 🙂

Maybe you have noticed that the functions parse and parse_loop_body are very similar. They only differences (except for the return type) are:

  • what happens if a loop end character ] is found (return "unexpected loop end" error or return parsed instructions and the remaining source after the loop),
  • what happens if there is no more input to parse (return parsed instructions or return "missing loop end" error).

Even though we will finally implement a new, much simpler parser with the nom crate, factoring out the common parts into a generic function, and parametrizing it with the desired "loop end" and "input end" behavior is a nice opportunity to learn how to pass functions as arguments in Rust functions.

This post is already very long though, so we will postpone this.

Summary

We have implemented a parser, which can transform the source code for any Brainfuck program into an abstract syntax tree. This abstract syntax tree can then be executed with the engine that we have developed in the previous posts in this series.

Along the way, we used a few interesting Rust concepts, such as iterators. We have also worked with the generic enum types Option and Result and found concise ways to work with them.

In the next posts, we will learn how to parse command line arguments and then turn our Brainfuck interpreter into an application that can be run from the command line. Moreover, we will simplify our parser and see that using the right crates makes parsing much easier than writing a parser from scratch.


  1. We use byte slices, rather than strings, to represent Brainfuck source code here. This is why the literals are prefixed with b, which makes them byte string literals. Our parser works by consuming one character at a time, and it is easier to determine the remaining input if the consumed character always has the same size, which is one byte for byte slices. In strings, which are represented as UTF-8 in Rust, a character can correspond to a variable number of bytes.

  2. In some languages, such as Java, all variables (except primitive types like int) can be null. This is not quite the same as an optional value though, which is why Java 8 introduced the generic type Optional<T>.

  3. This function is just used as a simple example for using the ? operator with Option. This particular task could be solved more easily with the map method of Option.

Comments