Building a Compiler & Interpreter in Rust. part i
Computers do not understand human languages directly, as they operate in binary machine code. To bridge this gap, intermediaries like compilers and interpreters convert human-readable code into machine-readable instructions. High-level and low-level programming languages were invented to simplify the process of giving instructions to computers in a way that humans can more easily understand and write. Difference between compilers and interpreters: A compiler translates the entire source code of a high-level programming language into machine code in one go. Thereafter, the system stores and executes the machine code. This approach is efficient for execution, as the translation happens before the program runs. Examples of compiled languages include C, C++, and Rust. Meanwhile, interpreters translate high-level code into machine code line by line, executing each instruction immediately after translation. This allows for dynamic execution but can be slower compared to compiled code. Examples of interpreted languages include: Python, JavaScript, PHP, etc. Building a compiler for an MCL language in Rust-general file structure and function. MCL is an invented language, you can decide to build yours if you feel excited. The objective would be to take the high level code from the .mcl file, using the code in the compiler file. The compiler reads the keywords that match the already started instructions, then converts the file to bytecode(an intermediary language ). The interpreter has a Virtual Machine that is called to execute the bytecode into machine code, and the output back into human readable language is outputted in the terminal. Main file &General overview. Compiling and executing: The compiler processes the .mcl file, converting it into an intermediate representation ( the bytecode). You need to take instructions from the terminal. First, you need to import modules. This program provides a framework for compiling, executing, and handling bytecode. By breaking down the code, you will see how each function contributes to processing the .mcl file. Imports and Modules The first part of the code imports the important libraries and modules: // clap::Parser: Used for command-line argument parsing. use clap::Parser; // anyhow::Result: Simplifies error handling. use anyhow::Result; // log library: Provides logging functionality for different levels (e.g., info, debug, error). use log::{debug, info, error}; // std utilities for file manipulation, path management, exiting the program, and writing to files. use std::fs; use std::path; use std::process::exit; use std::io::Write; // Modules (compilers, interpreter, op, instr) define core functionality // for the virtual machine, compilation, and bytecode handling. // These are in separate files and are imported as modules into main.rs. mod compilers; mod interpreter; mod op; mod instr; // Specific imports from the modules. use crate::instr::Instr; use crate::interpreter::{VM, decode_instructions, encode_instructions}; Argument Parsing with clap The Args structure defines command-line arguments: #[derive(Parser)] struct Args { #[arg(long, default_value_t = false)] //--compile: Compiles source files. compile: bool, #[arg(long, default_value_t = false)] //--exec: Executes bytecode. exec: bool, #[arg(long, default_value_t = false)] //--optimize: Enables optimization (placeholder, unused here). optimize: bool, #[arg(long, default_value_t = false)] // --decompile: For decompiling bytecode (placeholder, unused here). decompile: bool, #[arg(long)] // --file: Specifies the file to process. file: Option, } The Main Function The main function initializes the program and handles errors: fn main() { if let Err(err) = run() { error!("Error: {}", err); exit(1); } } Logging Setup You write the logger to format errors properly in the terminal. This ensures proper spacing and colours are added to the terminal when it’s logged. env_logger::builder() .filter_level(log::LevelFilter::max()) .format(|buf, record| { let level = match record.level() { log::Level::Error => "\x1b[31mERROR\x1b[0m", // Red log::Level::Warn => "\x1b[33mWARN\x1b[0m", // Yellow log::Level::Info => "\x1b[32mINFO\x1b[0m", // Green log::Level::Debug => "\x1b[34mDEBUG\x1b[0m", // Blue log::Level::Trace => "\x1b[35mTRACE\x1b[0m", // Magenta }; writeln!(buf, "{:14} | {}", level, record.args()) }) .init(); Core Logic a. Compile This section compiles the input file into bytecode if the --compile flag is set: let args = Args::parse(); let mut bytecode = vec![]; // Compile if args.compile { if let Some(filename) = args.file.as_deref() { if path::Path::new(&filename).extension() == Some(std::ffi::OsStr::new("mcl")) { b
Computers do not understand human languages directly, as they operate in binary machine code. To bridge this gap, intermediaries like compilers and interpreters convert human-readable code into machine-readable instructions. High-level and low-level programming languages were invented to simplify the process of giving instructions to computers in a way that humans can more easily understand and write.
Difference between compilers and interpreters:
A compiler translates the entire source code of a high-level programming language into machine code in one go. Thereafter, the system stores and executes the machine code. This approach is efficient for execution, as the translation happens before the program runs. Examples of compiled languages include C, C++, and Rust.
Meanwhile, interpreters translate high-level code into machine code line by line, executing each instruction immediately after translation. This allows for dynamic execution but can be slower compared to compiled code. Examples of interpreted languages include: Python, JavaScript, PHP, etc.
Building a compiler for an MCL language in Rust-general file structure and function.
MCL is an invented language, you can decide to build yours if you feel excited. The objective would be to take the high level code from the .mcl file, using the code in the compiler file. The compiler reads the keywords that match the already started instructions, then converts the file to bytecode(an intermediary language ). The interpreter has a Virtual Machine that is called to execute the bytecode into machine code, and the output back into human readable language is outputted in the terminal.
Main file &General overview.
Compiling and executing:
The compiler processes the .mcl file, converting it into an intermediate representation ( the bytecode). You need to take instructions from the terminal. First, you need to import modules. This program provides a framework for compiling, executing, and handling bytecode. By breaking down the code, you will see how each function contributes to processing the .mcl file.
Imports and Modules
The first part of the code imports the important libraries and modules:
// clap::Parser: Used for command-line argument parsing.
use clap::Parser;
// anyhow::Result: Simplifies error handling.
use anyhow::Result;
// log library: Provides logging functionality for different levels (e.g., info, debug, error).
use log::{debug, info, error};
// std utilities for file manipulation, path management, exiting the program, and writing to files.
use std::fs;
use std::path;
use std::process::exit;
use std::io::Write;
// Modules (compilers, interpreter, op, instr) define core functionality
// for the virtual machine, compilation, and bytecode handling.
// These are in separate files and are imported as modules into main.rs.
mod compilers;
mod interpreter;
mod op;
mod instr;
// Specific imports from the modules.
use crate::instr::Instr;
use crate::interpreter::{VM, decode_instructions, encode_instructions};
Argument Parsing with clap
The Args structure defines command-line arguments:
#[derive(Parser)]
struct Args {
#[arg(long, default_value_t = false)]
//--compile: Compiles source files.
compile: bool,
#[arg(long, default_value_t = false)]
//--exec: Executes bytecode.
exec: bool,
#[arg(long, default_value_t = false)]
//--optimize: Enables optimization (placeholder, unused here).
optimize: bool,
#[arg(long, default_value_t = false)]
// --decompile: For decompiling bytecode (placeholder, unused here).
decompile: bool,
#[arg(long)]
// --file: Specifies the file to process.
file: Option<String>,
}
The Main Function
The main function initializes the program and handles errors:
fn main() {
if let Err(err) = run() {
error!("Error: {}", err);
exit(1);
}
}
Logging Setup
You write the logger to format errors properly in the terminal. This ensures proper spacing and colours are added to the terminal when it’s logged.
env_logger::builder()
.filter_level(log::LevelFilter::max())
.format(|buf, record| {
let level = match record.level() {
log::Level::Error => "\x1b[31mERROR\x1b[0m", // Red
log::Level::Warn => "\x1b[33mWARN\x1b[0m", // Yellow
log::Level::Info => "\x1b[32mINFO\x1b[0m", // Green
log::Level::Debug => "\x1b[34mDEBUG\x1b[0m", // Blue
log::Level::Trace => "\x1b[35mTRACE\x1b[0m", // Magenta
};
writeln!(buf, "{:14} | {}", level, record.args())
})
.init();
Core Logic
a. Compile
This section compiles the input file into bytecode if the --compile flag is set:
let args = Args::parse();
let mut bytecode = vec![];
// Compile
if args.compile {
if let Some(filename) = args.file.as_deref() {
if path::Path::new(&filename).extension() == Some(std::ffi::OsStr::new("mcl")) {
bytecode = compilers::compile(&fs::read_to_string(filename)?)?
.into_iter()
.map(|instr| instr.to_u64())
.collect();
} else {
eprintln!("Error: Unsupported file extension for compilation.");
exit(1);
}
} else {
error!("Error: --file must be specified when using --compile");
exit(1);
}
}
This is where you take the inputs from the compilers file which you will come about in the next section. It takes raw inputs from the file and converts them into tokens that can be processed by the interpreter. It does this by checking if the input file is an .mcl file. It reads and compiles the file into bytecode using the compilers::compile function.
Execute
If --exec is specified, the program executes the bytecode:
// Execute
if args.exec {
if !args.compile {
if let Some(file) = args.file.as_deref() {
if path::Path::new(file).extension() == Some(std::ffi::OsStr::new("mcl")) {
error!("Error: Cannot execute an uncompiled file.");
exit(1);
}
bytecode = fs::read(file)?
.into_iter()
.map(|instr| instr.to_u64())
.collect();
} else {
eprintln!("Error: --file must be specified.");
exit(1);
}
}
let mut vm = VM::new();
vm.execute(bytecode)?;
}
It reads the compiled bytecode and executes it in a virtual machine (VM) using the vm.execute method.
Save Bytecode
If execution is not required, the bytecode is saved to a file:
else {
// Save bytecode to a file
if let Some(file) = args.file.as_deref() {
let instrs: Vec<Instr> = bytecode.iter().map(|&b| Instr::from_u64(b)).collect();
let mut new_file = fs::File::create(path::Path::new(file).with_extension("mclb"))?;
new_file.write_all(encode_instructions(&instrs)?.as_slice())?;
} else {
error!("Error: --file must be specified for output.");
exit(1);
}
}
Ok(())
The interpreter converts the bytecode back into instructions and saves it to a .mclb file, which is its output.
Conclusion.
In this article, you learned about the components of building a compiler and interpreter in Rust. You now know the difference between compiled and interpreted languages. Lastly, the general architecture of the compiler project which primarily consists of modules such as : Compilers, Interpreter, op, and instr. You will learn about the modules in future releases of the series.
If you are interested in the video verison of this article, check out