C0 to Clac: Targeting a Stack-based Virtual Machine

Aaron Gutierrez & Shyam Raghavan

15-411 Final Project

Fall 2015


Introduction

As current 15-122 TAs, we felt as though an interesting and difficult project would be to retarget the L4 compiler to a minimal adaptation of Clac. Through the implementation of this retargeted compiler, we hoped to learn the following: how to retarget a register-based language to a stack-based virtual machine language; how to minimize the instruction set needed for a normally very large register-based instruction set to a very small stack-based instruction set; and how to develop conventions for a very small language that mimic behaviors of a very large language.

In order to learn more about Clac and the process of compiling to a stack-based language, we researched information about Forth and the development philosophy behind Forth. This helped us determine what modifications were available to make Clac more targetable.

We also researched methods for considering the optimizations of stack-based virtual machines and for register allocation on the stack. While we decided that we would not focus on optimizations and would instead ensure correctness, we did implement a simple register allocator to better utilize the operand stack.

Retargeting the register-based x86-64 assembly language into a stack-based virtual machine language allowed us to delve into the specifics of the implementation of compilers that target stack-based virtual machines (like the Java Virtual Machine). One of the traditional advantages of a stack-based virtual machine language is that the instruction set (and therefore the number of bytes needed to represent each instruction) is very small - a disadvantage (that we clearly rediscovered) is that because the instructions are less specific, more instructions (and the total number of bytes needed to represent them) are needed to achieve the same functionality. Finally, we also needed to develop conventions on function calling, function flow, and memory - because we were retargeting from x86-64 assembly, most of the conventions we developed were inspired by it.

The development of this retargeted compiler, while not immediately useful, helped us learn skills necessary for engineering software for a rapidly-moving, ever-changing target, required us to be very careful in the engineering and the maintenance of conventions for a language and virtual machine, and forced us to consider tradeoffs between performance and correctness with the short amount of time and resources we had to complete the compiler.

Project

For this project, we chose to retarget C0 code to a minimal adaptation of the interpreted language based off of Clac used in 15-122 Principles of Imperative Computation called Clac.

In order to accomplish this, we must minimally alter the Clac runtime to allow for heap-like memory access and allocation. We must also alter the compiler to produce a stack-based internal representation and a back-end Clac-based representation. In sum, this means augmenting the internal representation and the back-end of the compiler.

We aimed to keep the clac language as close to the original as possible. The majority of the work in this project is derived from the difficulty in adapting an internal representation designed to target x86-64 assembly to output clac. While we could have improved performance and optimizations, we chose to instead focus on correctness and developed with the explicit belief that correctness was paramount over performance given our time constraints and the resources available.

Implementation

Modifications to the Claculator

The Clac language needed to be extended in order to successfully compile all (or at least a significant portion) of the features needed by the L4 language. We started with a working implementation of the Claculator from the fall 2015 version of the 15-122 assignment, which has user-defined tokens. We were able to implement almost all of the control flow needed using just user-defined tokens, with one exception, noted below.

Our first modification adds a mutable, global array of values to complement the operand stack. To use the memory array, the following instructions were added:

get:
n get will take the value stored at index n and place it at the top of the operand stack
put:
v n put will place the value v at index n in the memory array
alloc:
n alloc allocates a contiguous block of memory for n values and places the index of the first location on the top of the operand stack

These instructions were designed to be in the spirit of the original Clac language. The get and put tokens are similar in behavior to the ! and @ words, respectively, in the Forth language upon which Clac is based. The alloc token serves an analogous purpose to the same keyword in L4.

Additionally, the remaining comparison operators, <=, >=, =, != were added. These behave as expected. Bitwise operators were added to support bitwise arithmetic. Bitwise and (&), or (|), and exclusive or (^), along with arithmetic shifts (<< and >>), behave as they do in C0 and L4.

A couple other modifications were made to support programs that do not return a value. The abort token causes the Claculator to raise a SIGABRT signal, used for assertions. c0_mem_error raises a SIGUSR2 signal, used to indicate memory errors in the source program, rather than the Claculator.

With the above extensions, we are able to compile almost all programs in the L4 language, with one large exception. Clac’s only control flow tokens are if and skip. In order to handle programs that contain loops, the loop guard and loop body become separate functions. As a result, without further modification, we could not compile programs that return inside the body of a loop.

One of the features of the calculator that we aimed to preserve is the simplicity of the interpreter loop. Though the introduction of user-defined tokens and the call stack complicates things slightly, for the most part it consists of a queue where each token is considered exactly once. If we were to add backwards skips, the interpreter would become significantly more involved. So in order to support returning from the body of a loop, we made hack that aims to leave the interpreter loop simple.

To overcome that shortcoming we introduced a special return token. For most functions, return causes the rest of the instruction queue to be skipped, and the stack is left unmodified. But for functions whose names start with ‘.’ return causes the current function and the calling function to return. If the calling function’s name also starts with ‘.’, it’s calling function also returns recursively. We are therefore allowed to use functions like basic blocks and keep the interpreter loop relatively simple.

In order to better match the behavior of a C0 program, the Claculator was modified to not print a prompt or the operand stack when reading from a file, and to return the top value on the stack, if it exists, when exiting.

Modifications to Our Compiler

Because Clac programs are quite significantly different than x86-64 programs, we had to add a completely new back end to our compiler. We are able to use the same intermediate tree representation for both the x86-64 and the Clac back end. From that point, we add two new steps to generate Clac code.

Our first translation targets a high-level stack-based form that contains variables, branches, and unconditional jumps. This serves mostly as a translation from infix to reverse polish form. An example program in this form can be seen in Figure 1.

Figure 1: Example program after different stages
int foo(int a) {
  int b = 2 * a;
  if (a < 0) {
    b = 0;
  }

  return b;
}

int main() {
  return foo(5);
}
(a) L4 Source
foo(a) {
  %t0Q  <--  arg
  %t1Q  <--  (2 * %t0Q)
  if (((%t0Q < 0) == 1))
  goto .3
  goto .4
  .3
  %t1Q  <--  0
  goto .5
  .4
  .5
  return %t1Q
}

main() {
  return foo(5)
}
(b) Intermediate Tree
foo:
  storevar 0
  push 2
  loadvar 0
  mul
  storevar 1
  loadvar 0
  push 0
  <
  push 1
  ==
  if {
    push 0
    storevar 1
  } else {
     
  }
  loadvar 1
  return

main:
  push 5
  foo
  return
(c) High-level Stack Form

From this high-level language, then translate to Clac. Because Clac doesn’t specify how the memory array should be used, we emulate the x86_64 stack discipline. All function arguments are passed on the stack, and temporary variables are assigned from the highest-index down. Our Clac implementation creates a 50,000 element memory array, so we start from index 49,999. To simulate stack frames, we use memory address 0 as a pseudo-stack pointer. For example, the nth local variable for a function is access by:

0 get n + get

The value at address 0 is decremented the appropriate amount at the start of each function, and increment before every instance of return.

Conditional branches like

… exp if stms else stms …

are expanded to

… exp if .1 1 skip .2 …; : .1 stms ; : .2 stms ;

Loops such as

… .1 exp if stms goto .1 else …

become

… .1 …; : .1 exp if .2 1 skip .3 ; : .2 stms .1 ; : .3 ;

Translation of arithmetic and stack manipulation expressions are mostly trivial. The final output of our sample program can be seen in Figure 2.

: _c0_foo 0 get 3 - 0 put 0 get 0 + put 2 0 get 0 + get * 0 get 1 + put 0 get 0 + get 0 < 1 = if .6 1 skip .7 0 get 1 + get 0 get 3 + 0 put return ;
: .6 0 0 get 1 + put ;
: .7 ;
: _c0_main 0 get 0 - 0 put 5 _c0_foo 0 get 0 + 0 put return ;

49999 0 put _c0_main
Figure 2: Example program compiled to Clac

At the end of our Clac source file, we see our minimal runtime, which sets the stack pointer to the largest index, and calls the function _c0_main.

Memory operations were very straight forward to implement. Unlike when implementing L4 for x86-64, pointers are the same size as integers, because pointers are just indexes into the memory array. Dereferencing a pointer p is simply p get get, and computing array and struct offsets are greatly simplified by the fact that all small types have size 1.

In order to support array bound checking, the size of the array needs to be stored in memory. We employ the same strategy as before, storing the size at the index before the first array element. Bounds checks and null checks are implemented exactly as before, adding an if statement before all accesses and conditionally evaluating the new c0_mem_error token in Clac if the access is invalid. NULL is represented by the value 0.

Register Allocation

In the most straightforward translation, the operand stack is rarely employed. To increase utilization of the stack, we implemented Koopman’s algorithm for register allocation on stack machines. The algorithm as implemented is described below.

Koopman’s Algorithm, as implemented

  1. Compute the set of temps that is used at some point after each line within basic blocks. This analysis is greatly simplified as the only Clac instruction that uses a value is get.

  2. For each temp that is used within a block, when it is first written to, additionally push a copy of the value onto the top of the operand stack.

  3. For subsequent reads, replace the get instruction with a pick instruction that moves the corresponding value to the top of the stack

For simple programs, our register allocator works correctly. However, our implementation is not correct for all inputs, and time did not permit more effort to correct our implementation. As such, register allocation is disabled by default, but can be enabled with the –clac-alloc flag.

Because of the nature of the stack, values are not always in the same location relative to the top. As such, care must be taken to keep track of where values are located.

Figure 3: Simple program with and without register allocation
int main() {
  int t0 = 15;
  int t1 = 411;

  return t0 * 1000 + t1;
}
(a) L4 Source
0 get 3 - 0 put
15 0 get 0 + put
411 0 get 1 + put
0 get 0 + get
1000
*
0 get 1 + get
+
0 get 3 + 0 put
return
(b) Main method without register allocation
0 get 3 - 0 put
15 1 pick 0 get 0 + put
411 1 pick 0 get 1 + put
2 pick
1000
*
2 pick
+
0 get 3 + 0 put
return
(c) Main method with register allocation

In Figure 3, we see a sample program with and without register allocation. Notice that we still place values in the memory array even though they are no longer used. An implementation that recognizes dead put instructions is outside the scope of our project, but would be interesting to study further.

Testing Methodology

To test our implementation, we reused most of the tests from Lab 5. Any test that used library functions were deleted, as we did not port the full runtime to Clac, but all other features of L4 were implemented.

To run tests against our implementation, we modified the existing testing framework to pass the necessary flag for our compiler to target Clac. To run the code and produce output, it then calls our modified Claculator with the compiled Clac code. The Claculator is linked with a library that raises the correct signals to test for memory errors, SIGABRT is generated using the C0 assert function, and otherwise the value returned from the main method is return by the main method of the Claculator. Thus, we should see exactly the same effects from running our Clac code using the Claculator and the same code compiled natively for x86-64. As such, the testing framework took very little modification.

For speed, the Claculator is compiled with the unsafe C0 runtime using the -O2 flag for gcc. Even so, execution time is significantly higher than running the native executable. For instance, the gcd benchmark runs around 500 times slower than our native compiler output with all optimizations disabled. We could have chosen to increase the timeouts for testing, but chose not to so that the grading time was still sane.

Analysis

Quantitative Analysis

We explicitly made the decision to not focus on optimizations or performance - this decision was made with the reasoning that (a) the virtual machine implementation of Clac has no way of being possibly as fast as the hardware instruction set given by x86-64 assembly, precisely because the virtual machine implementation of Clac is compiled into x86-64 assembly, (b) stack-based machines are empirically slower than register-based machines because of the larger number of instructions, and (c) with the limited amount of time and resources we had available, we felt as though we needed to focus on either correctness or performance, and chose correctness to be paramount.

While we did make this decision, it is important to note that, as mentioned earlier, the gcd benchmark runs about 500 times slower on the Clac virtual machine than on the native compiler output with all optimizations disabled. In addition, running on a system with two quad-core Intel X5560 CPUs running at 2.8Ghz and the default timeouts in the autograder, 208 tests timed out, as opposed to the regression tests, in which 72 tests timed out.

Again, because this was not the focus of this project, we chose to disregard performance issues and focus on improving correctness throughout our time working on the compiler. In the future, to improve the performance of our Clac-targeted compiler, we could improve the stack-frame addressing system - rather than executing 5 instructions (0 get i + get) to get the ith stack-frame variable, we could either alter the instruction set to include a stack pointer or improve the ability of the compiler to decide what needs to be stored on the operand stack and/or on the stack frame. We could also reduce the amount of dead code created by the compiler - while function arguments are evaluated from left to right in C0, the way we were able to implement this was by evaluating them from left to right, dropping them off the operand stack, and then evaluating them from right to left in preparation of calling the function (because the calling convention developed expected the arguments in left-to-right order). In order to reduce this dead code, we could either alter the calling convention to expect the arguments in right-to-left order (and then, instead of dropping arguments upon left-to-right evaluation, we leave it on the operand stack) or alter the specification that arguments are evaluated in left-to-right order.

Regardless of the improvements we can make to the Clac runtime or the compiled code, we know that we will never be able to improve the performance of the Clac virtual machine to be faster than that of x86-64 assembly. Because the only test cases we fail are those that particularly stress the compiler or the runtime, we feel as though the performance is robust enough for demonstration to curious 15-122 students that wonder why they’re asked to add function definitions to a language like Clac.

On that point, with the single addition of the memory array, a large number of C0 programs can be successfully compiled to clac. Provided that all functions only return at the end, the addition of return to clac is avoided. With under 20 simple lines of additional code, a 15-122 student could have a way to run non-trivial programs in their own clac implementation for testing and exploration purposes.

Qualitative Analysis

In terms of qualitative analysis, it appears that Clac code is in fact (although precedent tells us that this is not the case) more concise than x86-64 assembly. This is probably because of the discrepancy between Clac being a virtual machine language and x86-64 assembly being a hardware instruction set.

The hardest areas to implement involved the design and fleshing out of the conventions having to do with control flow and calling functions. The difficulties presented themselves because there were no existing conventions, and many possibilities either allowed for ambiguities or did not ascribe to C0 conventions. For example, while some solutions for things like if-else control flow did ascribe to C0 semantics, they left room for ambiguities in the manner in which code that returned from a branch was supposed to execute. This particular problem was resolved with the introduction of a return token, but decisions like these forced us to carefully develop a set of guidelines and rules to follow when designing and writing the compiler. While making these convention decisions were not easy, the decisions that were made allowed the runtime for Clac to be (a) minimally altered and (b) very easy to invoke.

Conclusion

In conclusion, while the retargeting of a C0 compiler to Clac virtual machine code is not the most practically useful, we’ve learned quite a few skills that will have very practical applications both in academia and in the software industry.

By completing this compiler, we were able to develop our own calling convention and control flow convention. We also developed a code generation step from an intermediate tree representation to a stack-based intermediate representation. All of these tasks required us to be very deliberate about documentation, to consider tradeoffs between performance and correctness with a limited amount of resources, and develop a piece of software for a rapidly-changing target (like from x86-64 assembly to Clac virtual machine code).

Finally, it has been an amazing experience and a really fun one building the largest single piece of code we’ve seen from start to finish. We’ve enjoyed the class immensely, and we’d like to thank Rob and the course staff for making it such an enjoyable experience.