=head1 Stack To Register Translation
The .NET CLI is a stack based virtual machine whereas Parrot is register based.
This document details what that means and how the challenges the difference
presents will be dealt with.
=head2 Parrot's Register Architecture
In general, register machines have a set of registers, perhaps numbered from
r0 to r15. These registers are used to store data that is currently being
operated on. Instructions load data from memory into registers, save data
from registers into memory, move the contents of one register into another
and perform arithmetic and logical operations. The operands for these are
registers.
load r0, SOME_ADDRESS
load r1, SOME_OTHER_ADDRESS
add r2, r0, r1
store r2, YET_ANOTHER_ADDRESS
Here two values are being loaded from memory into registers r0 and r1. The
add operation is adding the contents of registers r0 and r1 and storing it
in r2. Finally, the contents of r2 is being stored in memory.
Just as some hardware CPUs have different types of registers (often one set for
integers and another for floating point numbers), Parrot has 4 register types.
Integer (I) registers hold native integers. Number (N) registers hold native
floating point numbers, usually an IEEE 756 Double on platforms where this is
supported. String (S) registers hold references to strings. PMC (P) registers
hold references to PMCs (basically, any more complex type).
In PASM (Parrot Assembly Language), registers are denoted I0, I1, I2, etc. The
numbers index into the array of registers of the given type.
set S0, "Hello world!\n"
print S0
end
In PIR (Parrot Intermediate Representation), virtual and named registers are
introduced. Virtual registers are denoted as $I0, $I1, $I2, etc.
.sub main :main
set $S0, "Hello world!\n"
print $S0
end
.end
Named registers are introduced using the ".local" syntax.
.sub main :main
.local string hello
set hello, "Hello world!\n"
print hello
end
.end
PIR is a thin layer on top of PASM. The virtual registers and named registers
get mapped to real Parrot registers, at the time of writing using a graph
coloring algorithm.
Unlike hardware CPUs, where a register is a physical area of storage and there
are a fixed number of these, Parrot's registers are simply an array in memory.
Initially Parrot mirrored this and provided a fixed number (32) of registers of
each type. If PIR programs needed more registers than were available, P31 was
reserved to hold a spilling array or stack. Later it was realized that there
was no reason not to have variable sized register frames (a register frame
being the set of registers corresponding to the currently executed unit of code
e.g. a subroutine, closure or coroutine). This eliminates the expense of
spilling and simplified register allocation greatly. Also, it means that less
memory is consumed by register frames for small functions.
This translator emits PIR. This means that it can freely use virtual registers
as needed and rely on the PIR compiler to minimize the number of real registers
that are needed. This still leaves a lot of work to do in converting stack code
into good register code, but simplifies the problem by taking care of register
allocation and freeing the translator from having to do that.
=head2 The .NET CLI's Stack Architecture
In general, stack machines simply consist of a stack - an array of elements of
memory with a pointer to the "top" one. Items can be pushed onto or popped off
the top of the stack and instructions generally operate on the items on the
top of the stack, popping their operands of the stack and pushing their result
onto the stack. The following example computes (A*B + C), leaving the result
on the stack.
push A # Stack contains 1 item: A
push B # Stack contains 2 items: A and B
push C # Stack contains 3 items: A, B and C
mul # Stack contains 2 items: A and B*C
add # Stack contains 1 item: A*B + C
In practice the push instructions would actually be instructions to load a
local variable or argument of the current method, a field of the current
object/class or some other global data. Instructions exist to store data on
the top of the stack back to these locations once computation has taken place.
The stack locations as used in the above example are "untyped". In the above
example A, B and C may have been integers. This code could be followed by
identical code apart from using D, E and F which could be floating point
numbers, for example. Stack positions that previous held integers are now
used to hold floating point numbers. Data flow analysis can be used to
find the types of items currently on the stack as needed. In a sense, the
locations are transitionally typed.
The .NET CLI, while supporting a number of data types, both different sizes
and signings of integers as well as floating point numbers and more complex
reference types, only has four possible data types that may be stored on the
stack itself. These are 32-bit and 64-bit integers, native integers and a
floating point data type. Smaller integer types are widened and narrowed as
required on load and store.
The .NET CLI places an additional constraint of the instruction stream. The
specification, in Partition I Clause 12.3.2.1, states that "the type state of
the stack (the stack depth and types of each element on the stack) at any
given point in a program shall be identical for all possible control flow
paths". Therefore, a program like the following one would be forbidden.
if A # if operation skips the next instruction if A is false
push X
...
if A
pop X
In the region of instructions marked "..." the stack depth is unknown; it
depends on whether A is true or false. The two control flow paths that are
possible upon reaching "..." do not result in identical stack type states.
The property that makes code like this invalid means that the stack pointer
can always be determined statically; that is, at translation time.
=head2 Data On The Stack
There are essentially 3 types of item found on the .NET CLI's execution stack
that need to be mapped to Parrot registers.
=head3 Arguments
This is the data that has been passed to a method. In a stack architecture, the
calling convention is that parameters for the method are pushed onto the stack
before the method is called. They are accessed by a number of .NET instructions
that simply access the part of the stack where the currently executing method's
arguments are located. These intructions load the arguments onto the top of the
stack so they can be manipulated. The space they consume on the stack is
retained for the duration of the method call.
=head3 Locals
These are local variables that will almost always relate to variables declared
in the high level language source code. When the method is entered, space for
these is allocated on the stack. Just as with arguments, some intructions
exist for copying values from the area of the stack holding the locals onto
the top of the stack and vice versa. The space they consume on the stack is
retained for the duration of the method call.
=head3 Temporaries
These usually do not correspond to variables used in the high level language
code. Rather, they are introduced by virtue of the fact that the execution
model can only do a number of simple operations and therefore temporary
storage of working data is needed outside of local variables. A stack
machine has even greater need for this, for all operands need to be loaded
onto the stack. Any data on the stack that is not an argument or a local
can be considered a temporary.
=head2 Solutions For Allocating Arguments
Arguments consume a number of fixed stack locations. The calling conventions
for stack machines involve pushing the parameters for the call onto the
stack. Parrot's calling conventions are register based. However, PIR syntax
abstracts away the details of the calling conventions. Here is a factorial
function in PIR.
.sub fact
# Get input parameter.
.param int n
# return (n > 1 ? n * _fact(n - 1) : 1)
.local int result
if n > 1 goto recurse
result = 1
goto return
recurse:
$I0 = n - 1
result = fact($I0)
result *= n
return:
.return (result)
.end
Calling a function is simply written as it might be in a high level language -
the function name with arguments in parentheses. Parameters are retrieved by
the ".param" directive. Each ".param" is stored in a named virtual register.
.NET CLI code always retrieves parameters by number, the argument number being
a constant and thus known at translation time. Therefore, a naming scheme that
has the first parameter named "arg0", the second parameter named "arg1" etc.
will be sufficient.
=head2 Solutions For Allocating Locals
PIR provides a ".local" directive for declaring a local in a named register.
.local int name
.NET CLI code retrieves parameters by number, the number being a constant
and thus known at translation time. Therefore, a naming scheme that has
the first local named "local0", the second local named "local1" etc. will be
sufficient.
=head2 Solutions For Allocating Temporaries
Temporaries make up the part of the stack that changes in depth and type
during the execution of the method. Therefore the mapping is more complex.
This section investigates several options.
=head3 Having A Stack
One way to translate stack code to register code is to actually have a stack.
In the Parrot case, that could be a PMC array (which can also have strings and
boxed integers and numbers on it); the PMC array in Parrot supports push and
pop operations already. As the maximum stack depth is known in .NET (it is
stated in the method metadata) a fixed size PMC array can be allocated.
Translation of loads from arguments or locals can simply be a push - a one to
one instruction mapping.
push stack, arg1
push stack, local2
Translation of loads from fields involves two instructions; one to do the field
access and store the instruction in a register, and another to push the contents
of that register onto the stack. Store operations will have the same costs as
load operations depending on the storage location. At worst, a load or store
would translate to two instructions, with one in the common case.
While this sounds reasonable so far, the real weakness of this approach is
made clear when operations are being performed. For example, an addition
instruction under this scheme becomes 4 register operations.
pop stack, $I0
pop stack, $I1
add $I0, $I1
push stack, $I0
Maintaining a stack has overhead too; under the hood it translates to doing
v-table calls to the push and pop method of an array PMC. This will adversely
affect performance beyond simply having more operations to dispatch (or JIT).
The whole sequence of pushing a local and an arguemnt onto the stack, adding
them together and storing the result in another local variable (4 instructions
in stack code) can actually become a single register instruction.
add local3, arg1, local2
Where local3 is the destination for the result and arg1 and local2 are the
operands for the add instruction.
=head3 Lazy Stack Operations
One partial solution is to do pushes and pops lazily. Lazy pushing can be
implemented by keeping track of things that would be on the stack, but instead
no push instruction is generated. It is important to note that this tracking is
done at translation time, not when the code runs. On seeing a push in the .NET
stack code, the thing that would have been pushed, if it is a local or argument,
can have its register name stored in an array. Then, when an instruction such
as add is encountered, this array is inspected and instead of generating any
pop instructions, the names of the registers holding the source data are
substitued directly into the instruction. This eliminates 4 instructions - a
vast improvement. What about the store after the add? This will require peeking
ahead to the next instruction to see if it is a store to a local, and if it is
then replacing the destination register with the appropriate register, and
then not generating any code for that instruction. Here is an example where
some stack code to multiply two numbers passed as arguments using a loop
addition is translated.
STACK CODE LAZY STACK REGISTER CODE
LOOP: LOOP:
ldlocal 1 local1
ldarg 1 local1, arg1
if_ge END if local1 >= arg1 goto END
ldlocal 2 local2
ldarg 2 local2, arg2
add
stlocal 2 --> LOOKAHEAD add local2, local2, arg2
ldlocal 1 local1
iconst 1 local1, 1
add
stlocal 1 --> LOOKAHEAD add local1, local1, 1
br LOOP goto LOOP
END: END:
ldlocal 2 local2
ret .return(local2)
In this example, the register code produced is very good - as good as one could
produce by hand. However, it still requires a stack when a result is used by
a non-store instruction.
STACK CODE LAZY STACK REGISTER CODE
ldarg 1 arg1
ldarg 2 arg1, arg2
ldarg 3 arg3
add arg1 add $I0, arg2, arg3
push stack, $I0
mul arg1 pop stack, $I0
mul local1, arg1, $I0
stlocal 1 --> LOOKAHEAD
In addition to this, without analysing the basic blocks[2] that make up the
code to determine when it can be avoided, it is necesary to push all cached
items onto the stack before a branch or a conditional branch takes place or
a label is found. While this and a full set of optimizations to minimize
stack use are possible, this path will not be explored. Given that there are
essentially unlimited registers available to use, it would be desirable to
completely do away with having a stack. Doing stack operations costs v-table
calls, which harms performance and makes poor use of the Parrot execution
model.
=head3 Mapping Stack Locations To Registers
An early paper[1] on translating Java bytecode to native code suggested that,
in cases where the stack depth could be statically determined, each stack
location could be mapped to a register. This way, pushes and pops simply became
move instructions and a stack is no longer required. The condition that the
stack depth can always be determined statically holds in .NET CLI code, though
they do provide an algorithm for checking that this is the case in the paper.
This approach means that each stack instruction becomes one register
instruction, provided there is a single register instruction with equivalent
semantics to the stack instruction (that is, some .NET instructions may map
to multiple Parrot instructions independent of the stack to register
translation method used).
The following example demonstrates the translation of some stack code to
equivalent register code for a trivial example. Note that the stack pointer
is internal to the translator and simply denotes which virtual registers
hold the operands. Note also that the "set" instruction is how Parrot spells
"move".
STACK CODE REGISTER CODE STACK POINTER (BEFORE/AFTER)
LOOP: LOOP: 0 / 0
ldlocal 1 set $I0, local1 0 / 1
ldarg 1 set $I1, arg1 1 / 2
if_ge END if $I0 >= $I1 goto END 2 / 0
ldlocal 2 set $I0, local2 0 / 1
ldarg 2 set $I0, arg2 1 / 2
add add $I0, $I0, $I1 2 / 1
stlocal 2 set local1, $I0 1 / 0
ldlocal 1 set $I0, local1 0 / 1
iconst 1 set $I1, 1 1 / 2
add add $I0, $I0, $I1 2 / 1
stlocal 1 set local1, $I0 1 / 0
br LOOP goto LOOP 0 / 0
END: END: 0 / 0
ldlocal 2 set $I0, local2 0 / 1
ret .return($I0) 1 / 0
Note that this is a massive simplification; it is only possible to translate
instructions in the order they appear if there is always nothing on the stack
whenever a branch takes place. In general this is not true, and must be
resolved by identifying basic blocks and propogating the stack position at
a jump to the block along with the block, so when translation of that basic
block takes place the translator knows what the stack pointer is.
The code that has been produced here is far from optimal, however it only uses
registers, not a stack, which is a move in the right direction. Also, it is
now possible for a register code optimizer to improve it.
=head3 Clearing Redundant Moves
A paper entitled "The Case For Virtual Register Machines"[3] published in 2000,
along with the earlier paper[1], acknowledge that many of the move instructions
generated when mapping the stack onto some registers are redundant. The 2000
paper explains how a copy propogation algorithm was used that "re-wires the
source and destination registers of instructions to bypass moves". This can be
done fairly easily by converting the code into SSA form. It is possible that
the Parrot PIR optimizer could catch many of these cases.
PIR is textual in nature, and the less of it that needs to be parsed and
compiled, the better. While copy propogation as a post-translation optimization
could be implemented within the translator, it would likely require an
intermediate form for the translated code to avoid a lot of expensive text
manipulation operations. It would be desirable to just never generate the
moves that would be removed, just as instructions were never generated when
this was clearly avoidable in the lazy pushes/pops algorithm presented earlier.
=head3 Never Generating Redundant Moves
This method essentially does "lazy moves" when loading data from a register.
This covers loads of locals and arguments, which make a high proportioin of
the loads that are performed. It takes advantage of constraints on the .NET
instruction stream to do this in a single pass.
XXX TO DO
=head2 References
B<[1] Java Bytecode to Native Code Translation: The Caffeine Prototype And
Preliminary Results>
L<http://portal.acm.org/ft_gateway.cfm?id=243864&type=pdf>
Performance issues with imitating the stack, suggestion of mapping stack
locations to registers and additional problems with that.
B<[2] Modern Compiler Design>
Book
Description of basic blocks.
B<[3] The Case For Virtual Register Machines>
syntax highlighted by Code2HTML, v. 0.9.1