Learn Rust by building a Brainfuck interpreter, part 3: how you should not build a parser
Introduction¶
This is the third post in a series that builds a Brainfuck interpreter in Rust.
- In the first post, we created the first building blocks for our Brainfuck interpreter in Rust: a data structure for the tape, and a data type for the instructions. The instructions are based on Rust's enums, which allow to build powerful algebraic data types.
- In the second post, we built an execution engine that can execute a Brainfuck program which is represented by an abstract syntax tree in the context of a tape. We used the
Read
andWrite
traits to make our execution function generic, such that we could not only read from standard input and write to standard output, but also, e.g., read from byte slices and write to a vector of bytes. This makes unit testing easy.
Currently, we can only execute programs which are given as abstract syntax trees in Rust data structures, such as this one:
// read two byte values and write their sum
vec![
Read, Right, Read,
Loop(vec![Dec, Left, Inc, Right]),
Left, Write
];
This corresponds to ,>,[-<+>]<.
in Brainfuck source code.
In the current post, we will find ways to transform the latter into the former, i.e., to parse Brainfuck source code and generate an abstract syntax tree.
Please note: I intentionally did this without using libraries because I wanted to build a simple parser from scratch. Since I learned a lot this way, I found it appropriate to write a blog post about the process. However, building a parser with the help of libraries that are designed for this purpose is much easier and less error-prone. In a future post in this series, we will implement a better parser using the Rust crate nom.
Importing what we did so far into a Jupyter notebook¶
I've taken the code from the Jupyter notebooks that the first posts in this series were based on, and copied it to proper Rust files in the directory for the new blog post.
Just like in the last post, we will import these into the Jupyter notebook that is the source of the current blog post:
:dep rust_bf = { package = "rust-bf", path = "." }
use rust_bf::{instructions::{Instruction, Instruction::*}, tape::Tape, executor::execute};
A small extension compared to the last post is that the function that executes Brainfuck code now creates an empty Tape, so we can run programs more easily. We'll show this with a somewhat silly example program that reads a number $n$, then reads $n$ bytes, decreases each of them by one, and writes them. Finally, it writes a newline character (\n
):
let transform_input = vec![
// read number of items (n)
Read,
Loop(vec![
// decrease n by one
Dec,
// go right, read input, decrease by one, write output, go left
Right, Read, Dec, Write, Left
]),
// set current cell to 10 (ASCII code for newline, \n)
Inc, Inc, Inc, Inc, Inc, Inc, Inc, Inc, Inc, Inc,
// write output
Write
];
We will now execute this program and see what it does with the byte sequence b"\x0cIfmmp!Xpsme\""
as input:
let encrypted_message = b"\x0cIfmmp!Xpsme\"";
let mut m: &[u8] = encrypted_message;
execute(&transform_input, &mut m, &mut std::io::stdout());
So the value of encrypted_message
is just the input that transform_input
needs to output "Hello world!".
With the generic function execute
, we can mix and match input and output objects as long as they implement the traits Read
and Write
, respectively. In particular, we can collect the output in a Vec<u8>
, i.e., a vector of bytes. The following function will be useful, which uses a byte slice as input and writes the output both as a byte slice and as a string if the bytes contain valid UTF-8 data:
fn execute_with_input(program: &[Instruction], input: &[u8]) {
let mut output = Vec::new();
let mut input: &[u8] = input;
execute(program, &mut input, &mut output);
println!("Output as bytes: {:?}", output);
println!("Output as str: {:?}", std::str::from_utf8(&output));
}
We will test it with the program transform_output
and the encrypted message from above:
execute_with_input(&transform_input, encrypted_message);
Step 1: parse programs without loops¶
Since the hardest part of parsing Brainfuck code is to handle loops properly, we will start simple and first parse loop-less programs. A simple example is a variant of transform_input
which does not read the byte count, but uses a hard-coded number of input bytes instead. Then we can unroll the loop. For 12 input bytes, which happens to be the length of the string Hello World!
, this is equivalent to the following source code. Note that all characters which are not mapped to Brainfuck instructions are considered comments:1
let transform_12_input_bytes_source =
b" ,-. Repeat 12 times:
,-. * read a number
,-. * decrease by one
,-. * write it
,-.
,-.
,-.
,-.
,-.
,-.
,-.
,-.
> go right to an empty cell
++++++++++ store 10 (\n)
. write output
";
We will split the parsing process into two functions, the first of which takes a single input character and returns an optional instruction:
fn parse_simple_instruction(c: &u8) -> Option<Instruction> {
match c {
b'<' => Some(Left),
b'>' => Some(Right),
b'+' => Some(Inc),
b'-' => Some(Dec),
b',' => Some(Read),
b'.' => Some(Write),
b'[' | b']' => panic!("parse_simple_instruction cannot handle loops"),
_ => None // everything else is a comment
}
}
Option<T>
is an enum which has the two variants
-
None
, which corresponds to an empty value, -
Some(T)
, which holds a value of typeT
.
This concept will look familiar to readers who have used optional values in other languages, like, e.g., std::optional
in C++ or Maybe
in Haskell.2
We want to apply this function to every character in the Brainfuck source code. This could be done with a for
loop, but it is easier to make use of iterators by applying a sequence of functions to transform the source into the result that we want. I will not describe the basics of iterators in great detail here (see the relevant chapter in The Rust Programming Language or the standard library documentation), but much of what you can do with iterators should be easy to follow for readers who have experience with, e.g., streams in Java, ranges in C++, list transformations in Haskell, or similar constructs in other languages. Here we parse a small Brainfuck source snippet including comments:
&b".-, comment"
.iter()
.map(parse_simple_instruction)
.collect::<Vec<_>>()
-
.iter()
creates an iterator that allows to iterate through the input bytes, -
.map(parse_simple_instruction)
applies the function above to each element, -
.collect::<Vec<_>>()
collects the transformed elements into aVec
. Note that the element type need not be specified because the compiler can deduce that it isOption<Instruction>
.
We see that parsing the instructions worked. It's just a bit inconvenient that the handling of comments results in each instruction being wrapped in Some
, and that the None
values from comments and whitespace are in the result.
We could fix this by
- First filtering out the
None
values. In the code snippet below, this is done with a closure that calls theis_some
member ofOption
. We could also have defined a separate function (as we did withparse_simple_instruction
) for this purpose, but using a small anonymous function (also called lambda in other languages) is often easier. - And then unwrapping the instructions from
Some(...)
. This works similarly to unwrapping valid results from theOk
variant ofResult
, which we did in the previous post in this series:
b".-, comment"
.iter()
.map(parse_simple_instruction)
.filter(|opt| opt.is_some())
.map(|opt| opt.unwrap())
.collect::<Vec<_>>()
This works just fine, but the following solution is a bit more elegant:
b".-, comment"
.iter()
.filter_map(parse_simple_instruction)
.collect::<Vec<_>>()
filter_map
is a member of Iterator
that does just what we need:
- it applies the given function,
- it discards the
None
values, - and it unwraps each value that is wrapped in
Some(...)
.
Now we can build a function that can parse any Brainfuck program without loops. Note that we do not have to specify that .collect()
shall collect the resulting iterator into a Vec
because the compiler deduces this from the return type of the function:
fn parse_program_without_loops(input: &[u8]) -> Vec<Instruction> {
input.iter()
.filter_map(|c| parse_simple_instruction(c))
.collect()
}
We will now parse the 12 byte transformation program above and execute it to verify that it works:
let transform_12_input_bytes = parse_program_without_loops(transform_12_input_bytes_source);
execute_with_input(&transform_12_input_bytes, b"Ifmmp!Xpsme\"");
Step 2: parse programs with loops¶
One interesting aspect of loops is that each instruction that starts a loop, [
, must have a matching loop end, ]
, and vice versa. Therefore, not every combination of characters forms a valid Brainfuck program, and it would be good if our parser could detect invalid programs and report the first error in the source code.
There are two possible kinds of errors:
- There could be an unmatched loop end, i.e., a
]
where no loop was currently active. In this case, we will report the index of the unexpected]
in the sequence of source bytes. - There could be an unmatched
[
. This error could be fixed by adding an]
at the end, so we will just report "missing loop end at end of input". We could also try to find out the index of the unmatched[
, but this would complicate the parsing code. We will see in a future post that this is much easier when using the nom crate for parsing.
We will use an enum to represent the different error conditions:
#[derive(Debug)]
enum ParseError {
MissingLoopEndAtEndOfInput,
UnexpectedLoopEnd(usize)
}
To build a full Brainfuck parser, we will complement the function parse_simple_instruction
with three more functions:
-
parse_next_instruction
parses a single instruction, which can either be a simple instruction, or a loop. -
parse_loop_body
parses the body of a loop, i.e., the code between[
and]
. -
parse
parses an entire Brainfuck program.
These functions all have a similar signature, which we will discuss in a minute:
fn parse_next_instruction(source: &[u8]) -> Result<(Option<Instruction>, &[u8]), ParseError> {
todo!()
}
fn parse_loop_body(source: &[u8]) -> Result<(Vec<Instruction>, &[u8]), ParseError> {
todo!()
}
fn parse(source: &[u8]) -> Result<(Vec<Instruction>), ParseError> {
todo!()
}
- They take a byte slice,
&[u8]
, as the single argument. This is the source code which still needs to be parsed. - They return a
Result
, depending on whether parsing was successful or not. - The
Err
variant contains aParseError
which tells what the problem is. - The
Ok
variant contains the instructions which were parsed successfully, i.e.,- a
Vec<Instruction>
forparse
andparse_loop_body
, and - an
Option<Instruction>
forparse_next_instruction
, which parses at most one instruction. It will beNone
if the next byte insource
is a comment.
- a
- Moreover, since
parse_next_instruction
andparse_loop_body
do not parse the entire program, they also have to return the remaining source code which still has to be parsed. Therefore, these functions return a tuple with two elements in theOk
variant.
Small digression: alternatives to match
expressions¶
Since each of the functions that were outlined above returns a Result
, and the return value of parse_next_instruction
contains an Option<Instruction>
which is wrapped in the Result
, we have to handle the different variants of these enum values (Ok
and Err
for Result
, Some
and None
for Some
) when these functions call each other.
In the previous post, we saw how this can be done with pattern matching in match
expressions. However, there are simpler alternatives to deal with enums in some cases:
if let
¶
If we want to execute some code only if an enum value matches a specific pattern, we can use the if let
construct. For example, a function that takes a slice of integer values and prints the first value if the slice is not empty could be implemented like this:
fn print_first_if_not_empty(values: &[i32]) {
if let Some(first) = values.iter().next() {
println!("First value: {}", first);
}
}
print_first_if_not_empty(&[]); // no output
print_first_if_not_empty(&[3, 4, 4]);
let-else
¶
Another interesting situation is the following: if a value matches a pattern, then we want to assign a variable based on the value, and use this variable in the remainder of the current code block. Otherwise, we want to leave the current block with something like return
, break
, continue
, or panic!
. This can be achieved like this:
fn double_first_value(values: &[i32]) -> Option<i32> {
let Some(first) = values.iter().next() else {
eprintln!("No values found."); // prints to stderr
return None;
};
return Some(2 * first);
}
double_first_value(&[])
double_first_value(&[3, 4, 5])
Propagating errors with the ?
operator¶
Sometimes, a function has to handle a Result
value where only the Ok
variant can be used for further processing in a meaningful way, and an Err
should be returned to the caller directly. Similarly, functions which have to handle an Option
might not be able to do anything with a None
value and want to just pass it to the caller. To express this in a concise way, Rust has the ?
operator.
A question mark after a Result
or Option
value will try to unwrap the contained Ok
or Some
value, respectively, and return any Err
or None
values to the caller directly.
A simple example would be a function that doubles an optional integer:3
fn double(x: Option<i32>) -> Option<i32> {
Some(2 * x?)
}
If x
is a Some
value, then the expression x?
will be equal to the contained value. Otherwise, None
will be returned from the function immediately:
double(Some(4))
double(None)
Even though it can be used to handle Option
values, the ?
operator is more interesting for Result
values:
fn to_u64(x: Result<f64, &str>) -> Result<u64, &str> {
let x: f64 = x?;
if x < 0.0 || x > u64::MAX as f64 {
Err("value outside u64 range")
} else {
Ok(x.round() as u64)
}
}
Ok
values that contain a number will just be unwrapped for further processing:
to_u64(Ok(3.5))
to_u64(Ok(-1.0))
Err
input values will be returned immediately:
to_u64(Err("invalid number"))
Implementing the functions for our parser¶
Now that we know how to handle Option
and Result
values without match
expressions, let us start with the implementation of parse_next_instruction
. It will be called by the other two functions, but only if there is still something to parse. Moreover, the callers will be responsible for handling loop ends. So we can safely assume that we can extract the first character from the source
argument, and that it is not ]
. We will just panic if these conditions are not fulfilled:
fn parse_next_instruction(source: &[u8]) -> Result<(Option<Instruction>, &[u8]), ParseError> {
// Extract first character
let Some(c) = source.iter().next() else {
panic!("parse_next_instruction() should not be called with an empty source.");
};
// Store everything after this character in remaining_source
let remaining_source = &source[1..];
match c {
b']' => panic!("loop end should be handled in the caller"),
// Loop: wrap the instructions from the loop body in Loop(...) and forward
// the code after the loop end to the caller for further processing
b'[' => parse_loop_body(remaining_source)
.map(|(instructions, source_after_loop)|
(Some(Loop(instructions)), source_after_loop)),
// Everything else is straightforward to parse :-)
c => Ok((parse_simple_instruction(c), remaining_source))
}
}
As you can see in the match arm that parses a loop, the method map(...)
can be applied to the Result
that is returned by parse_loop_body
. This means that the mapping is applied only if the Result
contains an Ok
value. An Err
value is left as it is.
We can now implement the function which parses a loop body. Since it might have to parse more than one instruction, we process the source code in a loop, and store the instructions that are found in a mutable Vec
:
fn parse_loop_body(source: &[u8]) -> Result<(Vec<Instruction>, &[u8]), ParseError> {
let mut result: Vec<Instruction> = Vec::new();
let mut remaining = source;
while remaining.len() > 0 {
if remaining[0] == b']' {
// We have found the loop end. Return the parsed instructions and the code after ']' to the caller.
return Ok((result, &remaining[1..]));
}
// Try to parse the next instruction. Forward any errors to the caller.
let (opt_instruction, source_after_instruction) = parse_next_instruction(remaining)?;
if let Some(instruction) = opt_instruction {
// The character was a valid instruction, and not a comment.
result.push(instruction);
}
remaining = source_after_instruction;
}
// The expected loop end was not found :-(
Err(ParseError::MissingLoopEndAtEndOfInput)
}
Before we proceed with parsing an entire Brainfuck program, it would be good if we could test what we have implemented so far. We will set up a function that takes some Brainfuck source code, and prints the first instruction that is parsed by parse_next_instruction
(if any) and the remaining source code, or the error that was encountered:
fn test_parse_next_instruction(source: &[u8]) {
match parse_next_instruction(source) {
Ok((opt_instruction, remaining_source)) => {
println!("instruction: {:?}", opt_instruction);
println!("remaining: {:?}", std::str::from_utf8(remaining_source).unwrap());
}
Err(error) => {
println!("error: {:?}", error);
}
}
}
First we will test parsing a simple instruction:
test_parse_next_instruction(b"+-<>");
If the source starts with a comment, parse_simple_instruction
returns None
and the remainder of the source which still needs to be parsed:
test_parse_next_instruction(b"comment ...");
An unterminated loop will result in an error:
test_parse_next_instruction(b"[");
A correct loop is parsed as a single instruction:
test_parse_next_instruction(b"[-<.]>>>>");
Nested loops also work just fine:
test_parse_next_instruction(b"[+++[>+<-]]");
Finally, we will implement the function parse
, which parses a full program:
fn parse(source: &[u8]) -> Result<(Vec<Instruction>), ParseError> {
let mut result: Vec<Instruction> = Vec::new();
let mut remaining = source;
while remaining.len() > 0 {
if remaining[0] == b']' {
// Loop ends are only expected in parse_loop_body :-(
return Err(ParseError::UnexpectedLoopEnd(source.len() - remaining.len()));
}
// Try to parse the next instruction. Forward any errors to the caller.
let (opt_instruction, source_after_instruction) = parse_next_instruction(remaining)?;
if let Some(instruction) = opt_instruction {
// The character was a valid instruction, and not a comment.
result.push(instruction);
}
remaining = source_after_instruction;
}
Ok(result)
}
First, we will test if the correct error is returned for some invalid programs:
parse(b"]")
parse(b"[[[]]]]")
parse(b"[")
Now let us try a valid non-trivial program to see if our parser works as expected.
The following program reads a byte $n$, and then outputs the first $n$ Fibonacci numbers:
let fib = parse(b"
,>>+<<
[->
[->>+<<]
>
[-<+>>+<]
>
[-<+>]
<<.<
]").unwrap();
execute_with_input(&fib, &[11]);
So everything seems to work fine 🙂
Maybe you have noticed that the functions parse
and parse_loop_body
are very similar. They only differences (except for the return type) are:
- what happens if a loop end character
]
is found (return "unexpected loop end" error or return parsed instructions and the remaining source after the loop), - what happens if there is no more input to parse (return parsed instructions or return "missing loop end" error).
Even though we will finally implement a new, much simpler parser with the nom crate, factoring out the common parts into a generic function, and parametrizing it with the desired "loop end" and "input end" behavior is a nice opportunity to learn how to pass functions as arguments in Rust functions.
This post is already very long though, so we will postpone this.
Summary¶
We have implemented a parser, which can transform the source code for any Brainfuck program into an abstract syntax tree. This abstract syntax tree can then be executed with the engine that we have developed in the previous posts in this series.
Along the way, we used a few interesting Rust concepts, such as iterators. We have also worked with the generic enum types Option
and Result
and found concise ways to work with them.
In the next posts, we will learn how to parse command line arguments and then turn our Brainfuck interpreter into an application that can be run from the command line. Moreover, we will simplify our parser and see that using the right crates makes parsing much easier than writing a parser from scratch.
-
We use byte slices, rather than strings, to represent Brainfuck source code here. This is why the literals are prefixed with
b
, which makes them byte string literals. Our parser works by consuming one character at a time, and it is easier to determine the remaining input if the consumed character always has the same size, which is one byte for byte slices. In strings, which are represented as UTF-8 in Rust, a character can correspond to a variable number of bytes.↩ -
In some languages, such as Java, all variables (except primitive types like
int
) can benull
. This is not quite the same as an optional value though, which is why Java 8 introduced the generic typeOptional<T>
.↩ -
This function is just used as a simple example for using the
?
operator withOption
. This particular task could be solved more easily with themap
method ofOption
.↩
Comments