Learn Rust by building a Brainfuck interpreter, part 1: implement the tape and model the program as an abstract syntax tree
Motivation¶
For many years, C++ used to be my favourite language for performance-critical applications. I find Python much more convenient for many use cases, but I have always enjoyed employing zero-cost abstractions and writing generic code that can be compiled to optimal assembly.
However, C++ has a number of downsides, such as
- problems that arise from undefined bahaviour in the code,1
- all the baggage it carries due to its backwards compatibility with almost all C++ code and much of the C code that has ever been written. A consequence is that C++ is more cumbersome and error-prone to use than more modern languages, and that developers are less productive than they could be.
Rust tries to address these issues and enable excellent performance at the same time, so I decided that I should give it a try a few years ago. I started playing around with it for Advent of Code problems, enjoyed it a lot, and wondered what practice project I could try next.
A non-trivial practive project for any programming language is to implement an interpreter. Brainfuck is the simplest language that I can think of, so a Brainfuck interpreter is what I implemented some time ago. Recently, I've improved it a bit and decided to write down the steps that I took.
Readers who have some experience with Rust will find this post easier to follow, but it does not assume any prior Rust knowledge. It does not aim to explain every concept in great detail though. If you want to know more, you can find a good introduction in The Rust Programming Language.
Brainfuck¶
You can read a lot about Brainfuck in its Wikipedia article. Its most important aspects are:
-
Brainfuck code operates on memory which is organized as a one-dimensional array of byte cells. I like to call it tape because it resembles the tape of a Turing machine.
Besides the tape itself, there is a data pointer which points to the current cell.
-
There are eight instructions, which are represented by a single character each.
instruction effect - decrement the byte inside the current cell by one + increment the byte inside the current cell by one < decrement the data pointer by one, such that it points one cell further to the left > increment the data pointer by one, such that it points one cell further to the right . output the byte in the current cell to standard output , read a byte from standard input and store it in the current cell [ loop start: if the byte in the current cell is zero, jump to the instruction after the matching ']' instruction, otherwise, go to the next instruction ] loop end: if the byte in the current cell is non-zero, jump to the instruction after the matching '[' instruction, otherwise, go to the next instruction All other characters are considered comments and are ignored during execution.
Loops which would correspond to
while (current_cell_value() != 0) { /* loop body: instructions between [ and ] */ }
in C-like languages can be built with '
[
' and ']
'. A '[
' and a ']
' instruction match just like parentheses would in an arithmetic expression, so you can build nested loops with these instructions. -
Brainfuck is Turing complete: it can in principle be used to write programs that perform any computation that you can think of (at least if the tape has infinite size). However, it is not very practical because most programs tend to be very long and convuluted. This makes Brainfuck a Turing tarpit.
How you can experiment with the code in this post¶
Like many of my posts, this post is a Jupyter notebook, which you can download, execute, and modify. You just need Jupyter, Rust, and the evcxr_jupyter crate. You can also work with the notebook easily in Binder (it might take a while to start up):
Modeling the tape in Rust¶
The original Brainfuck specification uses a finite tape with 30,000 cells. Each cell holds a single byte, which is initialized to zero. Moreover, there is a movable data pointer which points to the leftmost cell initially.
We could model this with an array with 30,000 bytes:2
struct ArrayTape {
data: [u8; 30000],
pos: usize
}
We can create a tape with all cells initialized to zero like this:
let my_tape = ArrayTape { data: [0; 30000], pos: 0 };
It is common to add a method new()
to a type, which creates a new instance of this type:
impl ArrayTape {
fn new() -> Self {
Self {
data: [0; 30000],
pos: 0
}
}
}
let my_tape_2 = ArrayTape::new();
Inside the impl ArrayTape { ... }
block, Self
is synonymous to ArrayTape
, so it can be used as the return type of new()
(after ->
) and inside the function. Note that no return
statement is needed: If no return
is encountered while executing a function, the final expression, which is not terminated with a semicolon, is returned.
We can make our Brainfuck interpreter more flexible by using a Vec
, which is essentially a dynamic resizable array. Then we can handle more than 30,000 cells, and we do not have to allocate space for and initialize cells which we might not need at all during execution: we will just add a new zero at the end whenever the pointer moves off the tape to the right. Initially, we will use just a single cell and initialize the Vec
with the vec!
macro:
struct VecTape {
data: Vec<u8>,
pos: usize
}
impl VecTape {
fn new() -> Self {
Self{
data: vec![0],
pos: 0
}
}
}
But we will use an even more flexible solution here, namely, a VecDeque
, which is a double-ended queue that allows efficient addition of items at either end. This enables programs which move the data pointer to the left from the initial position.
Moreover, we will use the macro #[derive(Debug)]
, which makes it possible to output the state of a tape easily, as we will see later:
use std::collections::VecDeque;
#[derive(Debug)]
struct Tape {
data: VecDeque<u8>,
pos: usize
}
impl Tape {
fn new() -> Self {
Self{
data: VecDeque::from([0]),
pos: 0
}
}
}
Now we will implement some functions that operate on a Tape
to make working with it convenient. First, we would like to read and write the value of the current cell. Unlike new()
, which does not operate on an existing Tape
instance (in C++, this would be indicated with the static
keyword), the next functions take a tape by reference and by mutable reference, respectively:3
impl Tape {
fn get(&self) -> u8 {
self.data[self.pos]
}
fn set(&mut self, value: u8) {
self.data[self.pos] = value;
}
}
We can test that this works as expected. Note how we have to declare the variable as mutable because immutability is the default in Rust:4
let mut t1 = Tape::new();
t1.get()
t1.set(5);
t1.get()
In a unit test, we would not want to look at the output, but prefer to have the expected results verified automatically. This can be done like this:
let mut t2 = Tape::new();
assert_eq!(t2.get(), 0);
t2.set(17);
assert_eq!(t2.get(), 17);
If such a check fails, the test will panic and output details about which assertion failed and what the expected and actual results were:
assert_eq!(t2.get(), 1);
So far, we can only read and write the value of the current cell. To model the entire Brainfuck instruction set, we also need functions that move the data pointer:
impl Tape {
fn right(&mut self) {
self.pos += 1;
if self.pos == self.data.len() {
// The data pointer is moving off the tape:
// add a new zero-valued cell at the back.
self.data.push_back(0);
}
}
fn left(&mut self) {
if self.pos > 0 {
// We have not reached the leftmost cell yet.
// Just move the data pointer to the left.
self.pos -= 1
} else {
// If self.pos is 0, the data pointer points to the leftmost cell.
// Add a new cell at the front and leave the pointer as it is.
self.data.push_front(0);
}
}
}
Let's try it, and then print the state of the tape. This can be done easily because we have used #[derive(Debug)]
when defining the Tape
:
let mut t3 = Tape::new();
t3.set(2);
t3.right();
t3.set(3);
t3.left();
t3.left();
t3.set(1);
t3
How to model Brainfuck instructions: Rust enums
are more than just named numbers¶
Now that we have a data structure for the tape, it's time to model the instructions. We will use an enum
for that.
It might be tempting to use a definition like this:
enum InstructionFirstTry {
Inc,
Dec,
Left,
Right,
Read,
Write,
LoopStart,
LoopEnd
}
This looks pretty much like an enum
in C/C++, where the values are mostly synonyms for specific integer values.
However, even though we have not started to implement the parser and the execution engine of the interpreter yet, we can already sense that this definition
- will make parsing easy, because each character in the set
+-<>.,[]
maps to exactly one instruction, but - leaves the task of finding matching
LoopStart
andLoopEnd
instructions to the execution engine, which will thus need to be more complex.
Here we will do it in a different way and parse the source code into an abstract syntax tree, which makes execution easy, but requires a bit more work in the parsing step. This design choice is often the more sensible one because
- usually, code is executed a lot more often than it is parsed, so it makes sense to move complex and possibly slow operations like loop delimiter matching to the parsing step,
- matching loop delimiters while parsing ensures that we can reject invalid code immediately, i.e., code where the
[
and]
instructions do not match, and - we will see later that a lot of the additional parsing complexity can be delegated to powerful libraries, such as
nom
.
How can we implement an abstract syntax tree for Brainfuck in Rust?
It turns out that Rust's enum
s can do more than just model integer constants: each variant of the enum
can hold additional data, so we can define an instruction Loop
that contains a Vec
with the instructions in the loop body:
#[derive(Debug)]
enum Instruction {
Inc,
Dec,
Left,
Right,
Read,
Write,
Loop(Vec<Instruction>)
}
In fact, Rust's enum
s are algebraic data types which were pioneered by functional programming languages like Haskell. In principle, C++ has std::variant
, which does have similar functionality. However, unlike C++, Haskell and Rust provide pattern matching,5 which is a very powerful and convenient way to interact with these data types with very little code, as we will se in the next post in this series.
To use an Instruction
variant, we have to prefix its name with Instruction::
:
let myInstruction = Instruction::Inc;
Often it is more convenient if we can omit the Instruction::
prefix. This can be achieved in this way:
use Instruction::*;
This is a valid loop instruction:
let loopInstruction = Loop(vec![Dec, Right, Inc, Left]);
And this is a full Brainfuck program that reads a byte, and then writes all numbers between 1 and the input byte:
let program = vec![Right, Inc, Left, Read, Loop(vec![Right, Write, Inc, Left, Dec])];
Since Instruction
implements the Debug
trait, we can also print the program easily:
program
Outside a Jupyter notebook, we would do this with the println!
macro:
println!("{:?}", program);
Note that the format specifier ':?
' inside the braces tells that the automatically implemented Debug
trait shall be used to print the value. The instruction
println!("{}", program)
would work only if we implemented the Display
trait for Instruction
. Then we would have to write some code, but we would be free to choose what a printed instruction should look like.
There is also a pretty-printing option for types that implement Debug
, which makes the tree structure of a Brainfuck program clearer:
println!("{:#?}", program);
Summary and outlook¶
We have modeled a tape for our Brainfuck interpreter with a struct
and implemented the operations that we need. Then we used an enum
to model Brainfuck instructions as an abstract syntax tree.
In the next post in this series, we will use these building blocks to implement an execution engine that applies the instructions to the state of the tape.
-
It is the programmer's responsibility to avoid the following in C and C++, and compilers can only provide very limited help with that:
- reading and writing out of the bounds of an array, an
std::vector
, or similar data structures, - dereferencing dangling pointers (or references), i.e., pointers that point to data that are not valid any more,
- reading and writing data from different threads at the same time,
- and much more, such as incrementing an
int
that holds the value 2147483647 a.k.aINT_MAX
by one, or dereferencing a null pointer, even if the value is never used and the dereference is optimized away completely.
All these cause undefined bahaviour, which means that basically anything can happen. The application might work as the developer expected, or it could yield incorrect results, or it could crash, or it could have a critical security vulnerability. Which of these you get can change at any time, e.g., when changing the optimization level or other compiler options, upgrading to a new compiler version, or modifying something in a different part of the code base.
A blog series that Chris Lattner wrote in 2011 on undefined behaviour (part 1, part 2, part 3) should still be mandatory reading for anyone who writes code in C and C++.↩
- reading and writing out of the bounds of an array, an
-
Note how Rust integer types make their bit width and their signedness explicit:
- signed integer types are
i8
,i16
,i32
,i64
,i128
, andisize
, - unsigned integer types are
u8
,u16
,u32
,u64
,u128
, andusize
.
The size of
isize
andusize
depends on the platform: 32 bits on 32-bit systems, and 64 bits on 64-bit systems. They correspond tosize_t
andssize_t
in C.↩ - signed integer types are
-
Note that
self
must be stated explicitly as a function parameter, and that access to struct members must also be qualified withself.
, just like in some other languages like, e.g., Python.↩ -
Note that the cell output is the final expression without trailing semicolon in the cell. This corresponds to the fact that functions return the final expression without trailing semicolon if there is no return statement.↩
-
One could argue that
std::visit
can be used as a workaround for pattern matching in C++. This is true, but using it feels more cumbersome and requires more code than proper pattern matching. There is a library which adds pattern matching to C++ and which might evolve into something that will be added to the C++ standard. Compared to pattern matching in, e.g., Haskell and Rust, it still feels a bit unwieldy though.↩
Comments