Project Information

In this course, you will first design a language and subsequently implement a complete compiler for this language. The entire course has been designed around this singular purpose, and my primary function is to help you accomplish this task.

Language Design

The design of the language is mostly up to you. However, to help ensure that the class is generally on the same page and implementing generally the same things at the same time, it's required that your language have certain attributes and support certain kinds of features. Details of these required attributes and features are below.

A non-exhaustive list of example features is provided here.

Target Language

So far the discussion has focused on the input to your compiler, namely, programs written in a language of your design. In this section, we discuss the output of your compiler, namely, a (hopefully!) equivalent program written in a (possibly) different language. Details follow.

First of all, you have complete control over your target language. You can compile all the way to assembly if you'd like. You can compile to a low-level representation above assembly, such as LLVM bitcode (which itself compiles to assembly), or Java Virtual Machine (JVM) bytecode (which runs on the JVM, like Java). You can even compile to another high-level language, like Java, C, or JavaScript (often called transpilation).

The fact that I do not require you to compile to assembly may seem strange, given that much content out there is about compiling to assembly. The primary reason why I am doing this is because I want this class to focus on modern compilers. Compiling directly to assembly is nowadays somewhat strange. If assembly is ultimately desired, it's common practice to use LLVM as a target instead, which itself can compile to assembly. Both Clang (a C compiler) and Rust compile to LLVM bitcode. LLVM is a much simpler target than assembly, and it will automatically do certain optimizations for you. LLVM can be customized to do language-specific optimizations, and LLVM can compile to a wide variety of assembly languages. As such, from a modern compiler standpoint, LLVM is almost certainly better than a custom approach. The only downside to selecting LLVM, from an educational standpoint, is that you won't have as good an understanding of how a specific target machine works (but you can still select assembly if that's important to you).

Similarly, the fact that I allow you to compile to a high-level language may seem strange. This is permitted because there has been an influx of languages and compilers that do exactly this. For example, Mercury compiles to C, TypeScript, CoffeeScript, and Dart are all custom languages which compile to JavaScript, and a variety of languages can be compiled to JavaScript thanks to special compilers (e.g., Emscripten for C, Scala.js for Scala, and ClojureScript for Clojure). (JavaScript is a popular target, given its ability to run natively in web browsers.) As such, compiling to a high-level language is not very strange at all. Moreover, this can simplify a number of basic things (many things become trivial), allowing you to experiment with more complex language features.

While you can choose any target language you want, your choice will have a dramatic impact on development, and will largely determine how difficult it will be to implement different parts of your language. For example, if your target language is JavaScript, implementing features that JavaScript already has (e.g., expressions, higher-order functions) will be a relatively trivial, likely one-to-one mapping. Conversely, if your target language is assembly, implementing even basic expressions is non-trivial. In order to keep the difficulty level of the various projects roughly the same, the following requirements and guidelines are in effect:

Special Note About Language Runtime and Feature Interaction

Certain programming language features are principally features of the language runtime system. For example, garbage collection and just-in-time compilation are two major runtime system features. The JVM and CLR are two examples of common runtime systems. This is a compiler-oriented class, not a runtime systems class; the distinction is that compilers generate code, and runtime systems execute that code. We could easily spend an entire class just on runtime systems. As such, if you want your language to have a feature which is classically supported by the runtime system, you should select a target language with the same feature. For example, if you want your language to have garbage collection, you should select a target that already supports garbage collection (e.g., JVM, CLR, JavaScript). Implementing a custom runtime system is a massive undertaking in and of itself, and you will not have enough time in this class to implement both a language and a runtime system (speaking from experience).

If you're not sure if a feature is a runtime system feature or not, just ask!

Compiler Language

You can select whatever language you like for implementing the compiler itself. However, certain implementation languages will be much more difficult to work with than others, and there is absolutely no benefit to making your life harder than it needs to be here. All else being equal, I'd recommend a high-level language which supports pattern matching. Pattern matching allows you to very directly do the following:

  1. Look at some input value
  2. If it matches this pattern, do this thing
  3. Instead, if it matches this other pattern, do this other thing
  4. ...and so on...

This sounds a lot like the more traditional switch, and they are related. However, pattern matching gives you some extra functionality, namely:

It may be worth looking into a high-level language with pattern matching, if you're not already familiar with one. If you're used to Java, I'd personally recommend looking at Scala, which runs on the JVM, interoperates easily with Java, and tends not to get in your way when you're first learning it. You may also be interested in OCaml, especially if you're interested in using LLVM bitcode as a target language (LLVM has native OCaml bindings).

All that said, diving into a new language is itself a big risk, so you may be better off suffering through some unclean code. If you're using an object-oriented language for implementation, you may be interested in the visitor design pattern, which can handle certain features of pattern matching (specifically, making sure all cases are covered). However, from my personal experience, the visitor design pattern is usually more trouble than it's worth, and I usually will do something like switch and will potentially cast objects frequently. This isn't considered good practice, but the fundamental problem is in the implementation language itself fighting you unnecessarily.

Grading (and Testing your Code)

Given that every project is different (perhaps radically so), grading becomes difficult. As such, I'm effectively offloading much of this responsibility onto you. It is your responsibility to demonstrate to me that your code works. The most effective way of doing this is via unit tests (to test individual components), and integration tests to make sure components work together correctly. If your tests test a wide variety of circumstances, this is good. It's even better if your tests have high code coverage (measured with a code coverage tool). Code without tests is unlikely to be evaluated well, especially if the code is difficult to follow.

Working in Teams

Each project is expected to be a massive undertaking, and the difficulty level has been designed with the assumption that each project will be backed by 2-4 people. To be clear, each project will require you to do the following:

None of this is expected to be easy. From my own recent experience, it took me eight weeks to solo implement a naive, non-optimizing compiler from Prolog to C. From another project, I already had a lexer, parser, and typechecker available. Matching my own development timeline to the course timeline, this development wasn't fast enough; I would have missed deadlines. The point: you almost assurredly cannot do this alone.

Working with teams is a scary prospect, given the classic problem of unbalanced workloads between team members. Using the team-oriented nature of this class to coast along on someone else's effort is not acceptable. To make sure people aren't coasting, I will be doing two things for each deadline:

  1. On each deadline, you will submit an evaluation of yourself and of your team members. If you think someone is coasting, say so.
  2. We will be using GitHub for all code development, and I will measure the number of code contributions (lines added/edited/removed) that each team member has performed across all branches. If I see that someone's contribution count is significantly lower or higher than everyone else's, and no explanation is in the peer evaluations, I'll investigate. There are certain benign situations where this can happen; e.g., someone starts reading a lot of documentation on a target language, and diverts time away from coding with the team's consent. If you have far less code without explanation, you, and only you, will not receive credit for the deadline. Conversely, if you have far more code without explanation, I will contact you to ask what's going on.

    The reason for doing both peer evaluations and GitHub monitoring is to avoid situations where someone feels pressured to give someone a good peer evaluation, and similarly to avoid situations where it's one person's word against another's. Peer evaluations are easy to spoof, but code contributions are not.

Project Proposal

Now that you're familiar with the ground rules for the proposal, along with related caveats, the information about the proposal itself follows. The template you should use is here. Two example proposals have been provided, listed below:

  1. The pOOP langauge, an exploration of how class-based object-oriented programming languages can be compiled down to languages without object-oriented features. The main difficulty here is in code generation.
  2. The ScalellScript language, an exploration of how typechecking works in modern functional languages. The main difficulty here is in typechecking, with (likely) relatively straightforward code generation.

These examples are provided to give you an idea of what you can do. If you really want to use one of these instead of developing your own, talk to me first.

Language Documentation

You must provide documentation for your language. You have total control over the format of this documentation; for example, you could create a website for it with links to different parts (GitHub has free web hosting and makes this relatively easy), or create a more traditional document. No matter the format you use, your documentation must cover the following:

Your documentation can include more information than the above if you so choose, but it must cover the above elements.

In terms of the size, there are no hard minimum or maximum sizes. That said, I would expect that you'd need at least two pages to cover everything, and that you're probably going into too much detail if you need more than ~20 pages.

Internally, I will use this documentation to assist in the grading process. The documentation is your opportunity to guide me through what you've done, and to point out both the good and the bad. It's easy to put documentation off and to do a subpar job of it, but keep in mind that the documentation is worth a full 9% of your grade. This is more than the tokenizer and parser combined! I intentionally have this scored this way to incentivize you to spend time on the documentation. If you'd like early feedback on the documentation, just let me know and I'm happy to look at early drafts without grading anything.

Presentation

Your group must present your language during either the last day of lecture (Thursday, May 9) or the final exam slot (Thursday, May 16 from 12:45 PM - 2:45 PM in the usual room (JD 2216)). This presentation overlaps with some of your documentation. This presentation must cover the following:

Your presentation may cover additional content, if you wish.

As for the format of your presentation, the only hard constraint is that you will be given 12 minutes max, followed by ~5 minutes of Q&A. Some related details follow:

The overall purpose of the presentation is to show off what you've done to the class, and to give an opportunity for us to ask questions directly before final grading.