In this course, you will first design a language and subsequently implement a complete compiler for this language. The entire course has been designed around this singular purpose, and my primary function is to help you accomplish this task.
The design of the language is mostly up to you. However, to help ensure that the class is generally on the same page and implementing generally the same things at the same time, it's required that your language have certain attributes and support certain kinds of features. Details of these required attributes and features are below.
int x = 7; String s = x; // int cannot convert to StringStatic typing is likely the most controversial required feature, as it means you cannot implement a typical dynamically-typed language (e.g., Python, Ruby, PHP). I'm intentionally requiring typechecking because it's useful to know how the compiler does this when debugging type errors in any language, and to give a sense of why certain limitations exist around types.
1 + 2
is an expression that evaluates to 3
, as is foo("bar") + "baz"
.
Not everything in your language needs to be an expression (e.g., int x = 7
is not an expression in Java, but rather a statement), and it's ok if most things in your language are not expressions (e.g., Prolog's is
built-in treats its second argument as an expression, and it's the only construct in the language that works with expressions).
if
and while
are both common control structures.
If you want your language to be Turing-complete, you must permit either loops or recursion, if not both.
If you don't want your language to be Turing-complete, you must have a very good argument why, based on how you're planning the language to be used (e.g., “it's one less thing to implement” is not a valid reason, but ”it's important that programs in this language always terminate” can be legitimate).
Essentially all languages have control structures; even Prolog has conditional execution, even though it lacks the usual if
and while
.
5
as the first parameter of the function), which is limiting.
def addThis(x: Int): Int => Int = (y: Int) => x + yThe
addThis
function takes a parameter x
, and returns a function.
The function returned itself takes a parameter y
, and itself returns the value of x + y
.
Internally, when addThis
is called, it creates a closure which holds a function that does x + y
, along with the specific value of x
.
It's possible to represent something similar with objects and methods, for example, in Java:
public class AddThisHelper { private final int x; public AddThisHelper(final int x) { this.x = x; } public int doCall(final int y) { return x + y; } // roughly equivalent to Scala's addThis public static AddThisHelper addThis(final int x) { return new AddThisHelper(x); } }However, this is clearly a lot bulkier. While higher-order functions are best known from their use in functional programming languages (e.g., Lisp, Scheme, Racket, Haskell, Scala), most modern languages support them (e.g., JavaScript, C++11, Java 8, Go).
class HasId a where getId :: a -> Int instance HasId String where getId input_string = length input_stringThe above code defines the
HasId
typeclass, which operates over some generic type a
.
This typeclass is associated with the getId
function, which will take a value of type a
and return an Int
representing its id.
We then say that the String
type implements the HasId
typeclass, and define getId
for String
to return the length of the input string.
In Java parlance, the above code is very similar to the following:
public interface HasId { public int getId(); } public class String implements HasId { public int getId() { return length(); } }However, the above Java code doesn't work, since
String
is already defined for you.
As such, you cannot tack-on a method to it like we have.
Typeclasses, in contrast, allow us to effectively add methods to arbitrary types whenever we want.
Both Rust and Haskell use typeclasses.
A non-exhaustive list of example features is provided here.
So far the discussion has focused on the input to your compiler, namely, programs written in a language of your design. In this section, we discuss the output of your compiler, namely, a (hopefully!) equivalent program written in a (possibly) different language. Details follow.
First of all, you have complete control over your target language. You can compile all the way to assembly if you'd like. You can compile to a low-level representation above assembly, such as LLVM bitcode (which itself compiles to assembly), or Java Virtual Machine (JVM) bytecode (which runs on the JVM, like Java). You can even compile to another high-level language, like Java, C, or JavaScript (often called transpilation).
The fact that I do not require you to compile to assembly may seem strange, given that much content out there is about compiling to assembly. The primary reason why I am doing this is because I want this class to focus on modern compilers. Compiling directly to assembly is nowadays somewhat strange. If assembly is ultimately desired, it's common practice to use LLVM as a target instead, which itself can compile to assembly. Both Clang (a C compiler) and Rust compile to LLVM bitcode. LLVM is a much simpler target than assembly, and it will automatically do certain optimizations for you. LLVM can be customized to do language-specific optimizations, and LLVM can compile to a wide variety of assembly languages. As such, from a modern compiler standpoint, LLVM is almost certainly better than a custom approach. The only downside to selecting LLVM, from an educational standpoint, is that you won't have as good an understanding of how a specific target machine works (but you can still select assembly if that's important to you).
Similarly, the fact that I allow you to compile to a high-level language may seem strange. This is permitted because there has been an influx of languages and compilers that do exactly this. For example, Mercury compiles to C, TypeScript, CoffeeScript, and Dart are all custom languages which compile to JavaScript, and a variety of languages can be compiled to JavaScript thanks to special compilers (e.g., Emscripten for C, Scala.js for Scala, and ClojureScript for Clojure). (JavaScript is a popular target, given its ability to run natively in web browsers.) As such, compiling to a high-level language is not very strange at all. Moreover, this can simplify a number of basic things (many things become trivial), allowing you to experiment with more complex language features.
While you can choose any target language you want, your choice will have a dramatic impact on development, and will largely determine how difficult it will be to implement different parts of your language. For example, if your target language is JavaScript, implementing features that JavaScript already has (e.g., expressions, higher-order functions) will be a relatively trivial, likely one-to-one mapping. Conversely, if your target language is assembly, implementing even basic expressions is non-trivial. In order to keep the difficulty level of the various projects roughly the same, the following requirements and guidelines are in effect:
for
loops are trivial to implement if your target language already has while
loops.
String
in Java is a subtype of Object
.
auto
“type” uses a limited form of type inference, and type inference is used in a variety of modern languages (e.g., Haskell, Scala, Swift).
Certain programming language features are principally features of the language runtime system. For example, garbage collection and just-in-time compilation are two major runtime system features. The JVM and CLR are two examples of common runtime systems. This is a compiler-oriented class, not a runtime systems class; the distinction is that compilers generate code, and runtime systems execute that code. We could easily spend an entire class just on runtime systems. As such, if you want your language to have a feature which is classically supported by the runtime system, you should select a target language with the same feature. For example, if you want your language to have garbage collection, you should select a target that already supports garbage collection (e.g., JVM, CLR, JavaScript). Implementing a custom runtime system is a massive undertaking in and of itself, and you will not have enough time in this class to implement both a language and a runtime system (speaking from experience).
If you're not sure if a feature is a runtime system feature or not, just ask!
You can select whatever language you like for implementing the compiler itself. However, certain implementation languages will be much more difficult to work with than others, and there is absolutely no benefit to making your life harder than it needs to be here. All else being equal, I'd recommend a high-level language which supports pattern matching. Pattern matching allows you to very directly do the following:
This sounds a lot like the more traditional switch
, and they are related.
However, pattern matching gives you some extra functionality, namely:
value match { case Plus(e1, e2) => ... ... }The intention with this code is that it checks to see if an input is of the form
e1 + e2
, and if so, will do something accordingly.
In contrast, this code with switch
would be something like the following (in JavaScript):
switch (value.id) { case PLUS: var e1 = value.e1; var e2 = value.e2; ... }In JavaScript, separate statements were needed to extract out
value.e1
and value.e2
.
value match { case Plus(IntValue(0), e) => ... ... }The above code specifically looks for
0 + e
.
With switch
, the above code would be more like the following (in JavaScript):
switch (value.id) { case PLUS: if (value.e1.id == INT_VALUE && value.e1.int_value == 0) { var e = value.e2; ... break; } ...This can be very important for making things like peephole optimizations concise.
It may be worth looking into a high-level language with pattern matching, if you're not already familiar with one. If you're used to Java, I'd personally recommend looking at Scala, which runs on the JVM, interoperates easily with Java, and tends not to get in your way when you're first learning it. You may also be interested in OCaml, especially if you're interested in using LLVM bitcode as a target language (LLVM has native OCaml bindings).
All that said, diving into a new language is itself a big risk, so you may be better off suffering through some unclean code.
If you're using an object-oriented language for implementation, you may be interested in the visitor design pattern, which can handle certain features of pattern matching (specifically, making sure all cases are covered).
However, from my personal experience, the visitor design pattern is usually more trouble than it's worth, and I usually will do something like switch
and will potentially cast objects frequently.
This isn't considered good practice, but the fundamental problem is in the implementation language itself fighting you unnecessarily.
Given that every project is different (perhaps radically so), grading becomes difficult. As such, I'm effectively offloading much of this responsibility onto you. It is your responsibility to demonstrate to me that your code works. The most effective way of doing this is via unit tests (to test individual components), and integration tests to make sure components work together correctly. If your tests test a wide variety of circumstances, this is good. It's even better if your tests have high code coverage (measured with a code coverage tool). Code without tests is unlikely to be evaluated well, especially if the code is difficult to follow.
Each project is expected to be a massive undertaking, and the difficulty level has been designed with the assumption that each project will be backed by 2-4 people. To be clear, each project will require you to do the following:
None of this is expected to be easy. From my own recent experience, it took me eight weeks to solo implement a naive, non-optimizing compiler from Prolog to C. From another project, I already had a lexer, parser, and typechecker available. Matching my own development timeline to the course timeline, this development wasn't fast enough; I would have missed deadlines. The point: you almost assurredly cannot do this alone.
Working with teams is a scary prospect, given the classic problem of unbalanced workloads between team members. Using the team-oriented nature of this class to coast along on someone else's effort is not acceptable. To make sure people aren't coasting, I will be doing two things for each deadline:
We will be using GitHub for all code development, and I will measure the number of code contributions (lines added/edited/removed) that each team member has performed across all branches. If I see that someone's contribution count is significantly lower or higher than everyone else's, and no explanation is in the peer evaluations, I'll investigate. There are certain benign situations where this can happen; e.g., someone starts reading a lot of documentation on a target language, and diverts time away from coding with the team's consent. If you have far less code without explanation, you, and only you, will not receive credit for the deadline. Conversely, if you have far more code without explanation, I will contact you to ask what's going on.
The reason for doing both peer evaluations and GitHub monitoring is to avoid situations where someone feels pressured to give someone a good peer evaluation, and similarly to avoid situations where it's one person's word against another's. Peer evaluations are easy to spoof, but code contributions are not.
Now that you're familiar with the ground rules for the proposal, along with related caveats, the information about the proposal itself follows. The template you should use is here. Two example proposals have been provided, listed below:
These examples are provided to give you an idea of what you can do. If you really want to use one of these instead of developing your own, talk to me first.
You must provide documentation for your language. You have total control over the format of this documentation; for example, you could create a website for it with links to different parts (GitHub has free web hosting and makes this relatively easy), or create a more traditional document. No matter the format you use, your documentation must cover the following:
Your documentation can include more information than the above if you so choose, but it must cover the above elements.
In terms of the size, there are no hard minimum or maximum sizes. That said, I would expect that you'd need at least two pages to cover everything, and that you're probably going into too much detail if you need more than ~20 pages.
Internally, I will use this documentation to assist in the grading process. The documentation is your opportunity to guide me through what you've done, and to point out both the good and the bad. It's easy to put documentation off and to do a subpar job of it, but keep in mind that the documentation is worth a full 9% of your grade. This is more than the tokenizer and parser combined! I intentionally have this scored this way to incentivize you to spend time on the documentation. If you'd like early feedback on the documentation, just let me know and I'm happy to look at early drafts without grading anything.
Your group must present your language during the final exam slot (Wednesday, May 13 from 8:00 AM - 10:00 AM). This presentation overlaps with some of your documentation. This presentation must cover the following:
Your presentation may cover additional content, if you wish.
As for the format of your presentation, the only hard constraint is that you will be given 9 minutes max, followed by ~3 minutes of Q&A. Some related details follow:
The overall purpose of the presentation is to show off what you've done to the class, and to give an opportunity for us to ask questions directly before final grading.