=head1 NAME pirgrammar.pod - The Grammar of languages/PIR =head1 DESCRIPTION This document provides a more readable grammar of languages/PIR. The actual specification for PIR is a bit more complex. This grammar for humans does not contain error handling and other issues unimportant for this PIR reference. =head1 STATUS For a bugs and issues, see the section KNOWN ISSUES AND BUGS. The grammar includes some constructs that B in the IMCC parser, but are not implemented. Please note that languages/PIR is B the official definition of the PIR language. The reference implementation of PIR is IMCC, located in C. However, languages/PIR tries to be as close to IMCC as possible. IMCC's grammar could use some cleaning up; languages/PIR might be a basis to start with a clean reimplementation of PIR in C (using Lex/Yacc). =head1 VERSION 0.2.0 =head1 LEXICAL CONVENTIONS =head2 PIR Directives PIR has a number of directives. All directives start with a dot. Macro identifiers (when using a macro, on expansion) also start with a dot (see below). Therefore, it is important not to use any of the PIR directives as a macro identifier. The PIR directives are: .arg .invocant .begin_call .const .lex .call .line .end_return .end .loadlib .end_yield .endnamespace .local .end .meth_call .pragma .get_results .namespace .return .globalconst .nci_call .result .HLL_map .param .sub .HLL .begin_return .yield .include .begin_yield =head2 Registers PIR has two types of registers: real registers and symbolic or temporary (or I if you like) registers. Real registers are actual registers in the Parrot VM. The symbolic, or temporary registers are mapped to those actual registers. Real registers are written like: [S|N|I|P]n, where n is a positive integer. whereas symbolic registers have a B<$> prefix, like this: C<$P10>. Symbolic registers can be thought of local variable identifiers that don't need a declaration. This prevents you from writing C<.local> directives if you're in a hurry. Of course, it would make the code more self-documenting if C<.local>s would be used. =head2 Constants An integer constant is a string of one or more digits. Examples: C<0>, C<42>. A floatin-point constant is a string of one or more digits, followed by a dot and one or more digits. Examples: C<1.1>, C<42.567> A string constant is a single or double quoted series of characters. Examples: C<'hello world'>, C<"Parrot">. TODO: PMC constants. =head2 Identifiers An identifier starts with a character from [_a-zA-Z], followed by zero or more characters from [_a-zA-Z0-9]. Examples: C, C, C<_foo> =head2 Labels A label is an identifier with a colon attached to it. Examples: C =head2 Macro identifiers A macro identifier is an identifier prefixed with an dot. A macro identifier is used when I the macro (on usage), not in the macro definition. Examples: C<.myMacro> =head1 GRAMMAR RULES =head2 Compilation Units A PIR program consists of one or more compilation units. A compilation unit is a global, sub, constant or macro definition, or a pragma or emit block. PIR is a line oriented language, which means that each statement ends in a newline (indicated as "nl"). Moreover, compilation units are always separated by a newline. Each of the different compilation units are discussed in this document. program: compilation_unit [ nl compilation_unit ]* compilation_unit: global_def | sub_def | const_def | expansion | pragma =head2 Subroutine definitions sub_def: ".sub" sub_id sub_pragmas* nl body sub_id: identifier | string_constant sub_pragma: ":load" | ":init" | ":immediate" | ":postcomp" | ":main" | ":anon" | ":lex" | vtable_pragma | multi_pragma | outer_pragma vtable_pragma: ":vtable" parenthesized_string? parenthesized_string: "(" string_constant ")" multi_pragma: ":multi" "(" multi_types? ")" outer_pragma: ":outer" "(" sub_id ")" multi_tyes: multi_type [ "," multi_type ]* multi_type: type | "_" | keylist | identifier | string_constant body: param_decl* labeled_pir_instr* ".end" param_decl: ".param" [ [ string_constant "=>" ] type identifier ] [ get_flags | ":unique_reg" ]* nl get_flags: [ ":slurpy" | ":optional" | ":opt_flag" | named_flag ]+ named_flag: ":named" parenthesized_string? =head3 Examples subroutine The simplest example for a subroutine definition looks like: .sub foo # PIR instructions go here .end The body of the subroutine can contain PIR instructions. The subroutine can be given one or more flags, indicating the sub should behave in a special way. Below is a list of these flags and their meaning. The flag C<:unique_reg> is discussed in the section defining local declarations. =over 4 =item * :load Run this subroutine during the B opcode. B<:load> is ignored, if another subroutine in that file is marked with B<:main>. If multiple subs have the B<:load> pragma, the subs are run in source code order. =item * :init Run the subroutine when the program is run directly (that is, not loaded as a module). This is different from B<:load>, which runs a subroutine when a library is being loaded. To get both behaviours, use B<:init :load>. =item * :postcomp Same as C<:immediate>, except that the sub is not executed when compilation was triggered by a C instruction (in a different file). =item * :immediate This subroutine is executed immediately after being compiled. (Analagous to C in perl5.) =item * :main Indicates that the sub being defined is the entry point of the program. It can be compared to the main function in C. =item * :method Indicates the sub being defined is an instance method. The method belongs to the class whose namespace is currently active. (so, to define a method for a class 'Foo', the 'Foo' namespace should be currently active). In the method body, the object PMC can be referred to with C. =item * :vtable or vtable('x') Indicates the sub being defined replaces a vtable entry. This flag can only be used when defining a method. =item * :multi(type [, type]*) Engage in multiple dispatch with the listed types. =item * :outer('bar') Indicates the sub being defined is lexically nested within the subroutine 'bar'. =item * :anon Do not install this subroutine in the namespace. Allows the subroutine name to be reused. =item * :lex Indicates the sub being defined needs to store lexical variables. This flag is not necessary if any lexical declarations are done (see below), the PIR compiler will figure this out by itself. The C<:lex> attribute is necessary to tell Parrot the subroutine will store or find lexicals. =back The sub flags are listed after the sub name. The subroutine name can also be a string instead of a bareword, as is shown in this example: .sub 'foo' :load :init :anon # PIR body .end Parameter definitions have the following syntax: .sub main .param int argc :optional .param int has_argc :optional .param num nParam .param pmc argv :slurpy .param string sParam :named('foo') # body .end As shown, parameter definitions may take flags as well. These flags are listed here: =over 4 =item * :slurpy The parameter should be of type C and acts like a container that C up all remaining arguments. Details can be found in PDD03 - Parrot Calling Conventions. =item * :named('x') The parameter is known in the called sub by name C<'x'>. The C<:named> flag can also be used B an identifier, in combination with the C<:flat> or C<:slurpy> flag, i.e. on a container holding several values: .param pmc args :slurpy :named and .arg args :flat :named =item * :optional Indicates the parameter being defined is optional. =item * :opt_flag This flag can be given to a parameter defined I an optional parameter. During runtime, the parameter is automatically given a value, and is I passed by the caller. The value of this parameter indicates whether the previous (optional) parameter was present. =back The correct order of the parameters depends on the flag they have. =head2 PIR instructions labeled_pir_instr: label? instr nl labeled_pasm_instr: label? pasm_instr nl instr: pir_instr | pasm_instr NOTE: the rule 'pasm_instr' is not included in this reference grammar. pasm_instr defines the syntax for pure PASM instructions. pir_instr: local_decl | lexical_decl | const_def | globalconst_def | conditional_stat | assignment_stat | return_stat | sub_invocation | macro_invocation | jump_stat | source_info =head2 Local declarations local_decl: ".local" type local_id_list local_id_list: local_id [ "," local_id ]* local_id: identifier ":unique_reg"? =head3 Examples local declarations Local temporary variables can be declared by the directives C<.local>. .local int i .local num a, b, c The optional C<:unique_reg> modifier will force the register allocator to associate the identifier with a unique register for the duration of the compilation unit. .local int j :unique_reg =head2 Lexical declarations lexical_decl: ".lex" string_constant "," target =head3 Example lexical declarations The declaration .lex 'i', $P0 indicates that the value in $P0 is stored as a lexical variable, named by 'i'. Once the above lexical declaration is written, and given the following statement: $P1 = new 'Integer' then the following two statements have an identical effect: =over 4 =item * $P0 = $P1 =item * store_lex "i", $P1 =back Likewise, these two statements also have an identical effect: =over 4 =item * $P1 = $P0 =item * $P1 = find_lex "i" =back Instead of a register, one can also specify a local variable, like so: .local pmc p .lex 'i', p The same is true when a parameter should be stored as a lexical: .param pmc p .lex 'i', p So, now it is also clear why C<.lex 'i', p> is B a declaration of p: it needs a separate declaration, because it may either be a C<.local> or a C<.param>. The C<.lex> directive merely is a shortcut for saving and retrieving lexical variables. =head2 Constant definitions const_def: ".const" type identifier "=" constant_expr =head3 Example constant definitions .const int answer = 42 defines an integer constant by name 'answer', giving it a value of 42. Note that the constant type and the value type should match, i.e. you cannot assign a floating point number to an integer constant. The PIR parser will check for this. =head2 Global constant definitions globalconst_def: ".globalconst" type identifier "=" constant_expr =head3 Example global constant definitions This directive is similar to C, except that once a C has been defined, it is accessible from B subroutines. .sub main :main .global const int answer = 42 foo() .end .sub foo print answer # prints 42 .end =head2 Conditional statements conditional_stat: [ "if" | "unless" ] [ [ "null" target "goto" identifier ] | [ simple_expr [ relational_op simple_expr ]? ] ] "goto" identifier =head3 Examples conditional statements The syntax for C and C statements is the same, except for the keyword itself. Therefore the examples will use either. if null $P0 goto L1 Checks whether C<$P0> is C, if it is, flow of control jumps to label C unless $P0 goto L2 unless x goto L2 unless 1.1 goto L2 Unless $P0, x or 1.1 are 'true', flow of control jumps to L2. When the argument is a PMC (like the first example), true-ness depends on the PMC itself. For instance, in some languages, the number 0 is defined as 'true', in others it is considered 'false' (like C). if x < y goto L1 if y != z goto L1 are examples that check for the logical expression after C. Any of the I operators may be used here. =head2 Branching statements jump_stat: "goto" identifier =head3 Examples branching statements goto MyLabel The program will continue running at label 'MyLabel:'. =head2 Operators relational_op: "==" | "!=" | "<=" | "<" | <"=" | <"" binary_op: "+" | "-" | "/" | "**" | "*" | "%" | "<<" | <">>" | <">" | "&&" | "||" | "~~" | "|" | "&" | "~" | "." assign_op: "+=" | "-=" | "/=" | "%=" | "*=" | ".=" | "&=" | "|=" | "~=" | "<<=" | <">=" | <">>=" unary_op: "!" | "-" | "~" =head2 Expressions expression: simple_expr | simple_expr binary_op simple_expr | unary_op simple_expr simple_expr: float_constant | int_constant | string_constant | target =head3 Example expressions 42 42 + x 1.1 / 0.1 "hello" . "world" str1 . str2 -100 ~obj !isSomething Arithmetic operators are only allowed on floating-point numbers and integer values (or variables of that type). Likewise, string concatenation (".") is only allowed on strings. These checks are B done by the PIR parser. =head2 Assignments assignment_stat: target "=" short_sub_call | target "=" target keylist | target "=" expression | target "=" "new" string_constant | target "=" "new" keylist | target "=" "find_type" [ string_constant | string_reg | id ] | target "=" heredoc | target assign_op simple_expr | target keylist "=" simple_expr | result_var_list "=" short_sub_call NOTE: the definition of assignment statements is B complete yet. As languages/PIR evolves, this will be completed. keylist: "[" keys "]" keys: key [ sep key ]* sep: "," | ";" key: simple_expr result_var_list: "(" result_vars? ")" result_vars: result_var [ "," result_var ]* result_var: target get_flags? =head3 Examples assignment statements $I1 = 1 + 2 $I1 += 1 $P0 = foo() $I0 = $P0[1] $I0 = $P0[12.34] $I0 = $P0["Hello"] $P0 = new 42 # but this is really not very clear, better use identifiers $S0 = <<'HELLO' ... HELLO .local int a, b, c (a, b, c) = foo() =head2 Heredoc NOTE: the heredoc rules are not complete or tested. Some work is required here. heredoc: "<<" string_constant nl heredoc_string heredoc_label heredoc_label: ^^ identifier heredoc_string: [ \N | \n ]* =head3 Example Heredoc .local string str str = <<'ENDOFSTRING' this text is stored in the variable named 'str'. Whitespace and newlines are stored as well. ENDOFSTRING Note that the Heredoc identifier should be at the beginning of the line, no whitespace in front of it is allowed. Printing C would print: this text is stored in the variable named 'str'. Whitespace and newlines are stored as well. In IMCC, a heredoc identifier can be specified as an argument, like this: foo(42, "hello", <<'EOS') This is a heredoc text argument. EOS In IMCC, only B such argument can be specified. The languages/PIR implementation aims to allow for B number of heredoc arguments, like this: foo(<<'STR1', <<'STR2') argument 1 STR1 argument 2 STR2 B =head2 Invoking subroutines and methods sub_invocation: long_sub_call | short_sub_call long_sub_call: ".begin_call" nl arguments [ method_call | non_method_call] nl [ local_decl nl ]* result_values ".end_call" non_method_call: [ ".call" | ".nci_call" ] target method_call: ".invocant" target nl ".meth_call" [ target | string_constant ] parenthesized_args: "(" args ")" args: arg [ "," arg ] arg: [ float_constant | int_constant | string_constant [ "=>" target ]? | target ] set_flags? arguments: [ ".arg" simple_expr set_flags? nl ]* result_values: [ ".result" target get_flags? nl ]* set_flags: [ ":flat" | named_flag ]+ =head3 Example long subroutine call The long subroutine call syntax is very suitable to be generated by a language compiler targeting Parrot. Its syntax is rather verbose, but easy to read. The minimal invocation looks like this: .begin_call .call $P0 .end_call Invoking instance methods is a simple variation: .begin_call .invocant $P0 .meth_call $P1 .end_call Passing arguments and retrieving return values is done like this: .begin_call .arg 42 .call $P0 .local int res .result res .end_call Arguments can take flags as well. The following argument flags are defined: =over 4 =item * :flat Flatten the (aggregate) argument. This argument can only be of type C. =item * :named('x') Pass the denoted argument into the named parameter that is denoted by 'x', like so: .param int myX :named('x') # the type 'int' is just an example As was mentioned at the parameter declaration section, the C<:named> section can be used on an aggregate value in combination with the C<:flat> flag. .arg pmc myArgs :flat :named =back .local pmc arr arr = new .Array arr = 2 arr[0] = 42 arr[1] = 43 .begin_call .arg arr :flat .arg $I0 :named('intArg') .call foo .end_call The Native Calling Interface (NCI) allows for calling C routines, in order to talk to the world outside of Parrot. Its syntax is a slight variation; it uses C<.nci_call> instead of C<.call>. .begin_call .nci_call $P0 .end_call =head2 Short subroutine invocation short_sub_call: invocant? [ target | string_constant ] parenthesized_args invocant: target"." =head3 Example short subroutine call The short subroutine call syntax is useful when manually writing PIR code. Its simplest form is: foo() Or a method call: obj.'toString'() # call the method 'toString' obj.x() # call the method whose name is stored in 'x'. Note that no spaces are allowed between the invocant and the dot; C<"obj . 'toString'"> is not valid, this will be interpreted as a concatenation. And of course, using the short version, passing arguments can be done as well, including all flags that were defined for the long version. The same example from the 'long subroutine invocation' is now shown in its short version: .local pmc arr arr = new .Array arr = 2 arr[0] = 42 arr[1] = 43 foo(arr :flat, $I0 :named('intArg')) In order to do a Native Call Interface invocation, the subroutine to be invoked needs to be in referenced from a PMC register, as its name is B visible from Parrot. A NCI call looks like this: .local pmc nci_sub, nci_lib .local string c_function, signature nci_lib = loadlib "myLib" # name of the C function to be called c_function = "sayHello" # set signature to "void" (no arguments) signature = "v" # get a PMC representing the C function nci_sub = dlfunc nci_lib, c_function, signature # and invoke nci_sub() =head2 Return values from subroutines return_stat: long_return_stat | short_return_stat | long_yield_stat | short_yield_stat | tail_call long_return_stat: ".begin_return" nl return_directive* ".end_return" return_directive: ".return" simple_expr set_flags? nl =head3 Example long return statement Returning values from a subroutine is in fact similar to passing arguments I a subroutine. Therefore, the same flags can be used: .begin_return .return 42 :named('answer') .return $P0 :flat .end_return In this example, the value C<42> is passed into the return value that takes the named return value known by C<'answer'>. The aggregate value in C<$P0> is flattened, and each of its values is passed as a return value. =head2 Short return statement short_return_stat: ".return" parenthesized_args =head3 Example short return statement .return(myVar, "hello", 2.76, 3.14); Just as the return values in the C could take flags, the C may as well: .return(42 :named('answer'), $P0 :flat) =head2 Long yield statements long_yield_stat: ".begin_yield" nl return_directive* ".end_yield" =head3 Example long yield statement A C statement works the same as a normal return value, except that the point where the subroutine was left is stored somewhere, so that the subroutine can be resumed from that point as soon as the subroutine is invoked again. Returning values is identical to I return statements. .sub foo .begin_yield .return 42 .end_yield # and later in the sub, one could return another value: .begin_yield .return 43 .end_yield .end # when invoking twice: foo() # returns 42 foo() # returns 43 =head2 Short yield statements short_yield_stat: ".yield" parenthesized_args =head3 Example short yield statement Again, the short version is identical to the short version of the return statement as well. .yield("hello", 42) =head2 Tail calls tail_call: ".return" short_sub_call =head3 Example tail call .return foo() Returns the return values from C. This is implemented by a tail call, which is more efficient than: .local pmc results = foo() .return(results) The call to C can be considered a normal function call with respect to parameters: it can take the exact same format using argument flags. The tail call can also be a method call, like so: .return obj.'foo'() =head2 Expansions expansion: macro_def | include | pasm_constant include: ".include" string_constant pasm_constant: ".constant" identifier [ constant_value | register ] =head2 Macros macro_def: ".macro" identifier macro_parameters? nl macro_body macro_parameters: "(" id_list? ")" macro_body: * ".endm" nl macro_invocation: macro_id parenthesized_args? Note that before a macro body will be parsed, some grammar rules will be changed. In a macro body, local variable declarations are done using the C<.macro_local> directive. B. The C<.label> directive is available for declaring unique labels. macro_label: ".macrolabel" "$"identifier":" =head3 Example Macros When the following macro is defined: .macro add2(n) inc .n inc .n .endm then one can write in a subroutine: .sub foo .local int myNum myNum = 42 .add2(myNum) print myNum # prints 44 .end =head2 PIR Pragmas pragma: new_operators | loadlib | namespace | hll_mapping | hll_specifier | source_info new_operators: ".pragma" "n_operators" int_constant loadlib: ".loadlib" string_constant namespace: ".namespace" [ "[" namespace_id "]" ]? hll_specifier: ".HLL" string_constant "," string_constant hll_mapping: ".HLL_map" string_constant "," string_constant namespace_id: string_constant [ ";" string_constant ]* source_info: ".line" int_constant [ "," string_constant ]? id_list: identifier [ "," identifier ]* =head3 Examples pragmas .include "myLib.pir" includes the source from the file "myLib.pir" at the point of this directive. .pragma n_operators 1 makes Parrot automatically create new PMCs when using arithmetic operators, like: $P1 = new 'Integer' $P2 = new 'Integer' $P1 = 42 $P2 = 43 $P0 = $P1 * $P2 # now, $P0 is automatically assigned a newly created PMC. .line 100 .line 100, "myfile.pir" NOTE: currently, the line directive is implemented in IMCC as #line. See the PROPOSALS document for more information on this. .namespace ['Foo'] # namespace Foo .namespace ['Object';'Foo'] # nested namespace .namespace # no [ id ] means the root namespace is activated The first line opens the namespace 'Foo'. When doing Object Oriented programming, this would indicate that sub or method definitions belong to the class 'Foo'. Of course, you can also define namespaces without doing OO-programming. Please note that this C<.namespace> directive is I from the C<.namespace> directive that is used within subroutines. .HLL "Lua", "lua_group" is an example of specifying the High Level Language (HLL) for which the PIR is being generated. It is a shortcut for setting the namespace to 'Lua', and for loading the PMCs in the lua_group library. .HLL_map "Integer", "LuaNumber" is a way of telling Parrot, that whenever an Integer is created somewhere in the system (C code), instead a LuaNumber object is created. .loadlib "myLib" is a shortcut for telling Parrot that the library "myLib" should be loaded when running the program. In fact, it is a shortcut for: .sub _load :load :anon loadlib "myLib" .end TODO: check flags and syntax for this. =head2 Tokens, types and targets string_constant: [ encoding_specifier? charset_specifier ]? quoted_string encoding_specifier: "utf8:" charset_specifier: "ascii:" | "binary:" | "unicode:" | "iso-8859-1:" type: "int" | "num" | "pmc" | "string" target: identifier | register =head3 Notes on Tokens, types and targets A string constant can be written like: "Hello world" but if desirable, the character set can be specified: unicode:"Hello world" When using the "unicode" character set, one can also specify an encoding specifier; currently only C is allowed: utf8:unicode:"hello world" IMCC currently allows identifiers to be used as types. During the parse, the identifier is checked whether it is a defined class. The built-in types int, num, pmc and string are always available. A C is something that can be assigned to, it is an L-value (but of course may be read just like an R-value). It is either an identifier or a register. =head1 AUTHOR Klaas-Jan Stol [parrotcode@gmail.com] =head1 KNOWN ISSUES AND BUGS Some work should be done on: =over 4 =item * Heredoc parsing =item * Test. A lot. Bugs or improvements may be sent to the author, and are of course greatly appreciated. Moreover, if you find any missing constructs that are in IMCC, indications of these would be appreciated as well. Please see the PROPOSALS document for some proposals of the author to clean up the official grammar of PIR (as defined by the IMCC compiler). =back =head1 REFERENCES =over 4 =item * languages/PIR/lib/pir.pg - The actual PIR grammar implementation =item * PDD03 - Parrot Calling Conventions =item * PDD20 - Lexically scoped variables in Parrot =item * docs/pdds/draft/pdd19_pir.pod =back =head1 CHANGES 0.3.1 =over 4 =item * Remove .namespace for scopes =item * Some clean-ups =back 0.3.0 =over 4 =item * Remove C<.pcc_> prefix on PCC directives =item * Remove C<.emit> and C<.eom> directives. =back 0.2.0 =over 4 =item * Many clean ups; remove experimental C<:wrap> flag, remove C<.global> directive, remove C<.sym> directive, add C<.label> directive for macros, remove C<.sub>; remove some comments that are not true any more. In all, it's getting much cleaner! =back 0.1.4 =over 4 =item * Added C rule, moved C and C rules to that rule. Added C definition. =item * Removed newlines in operator definition to save some lines for readability. =back 0.1.3 =over 4 =item * Updated short sub invocation for NCI invocations. =item * Added an example for C<.globalconst>. =item * Added some remarks at section for Macros. =item * Added some remarks here and there, and fixed some style issues. =back 0.1.2 =over 4 =item * Removed C<.immediate>, it is C<:immediate>, and thus not a PIR directive, but a flag. This was a mistake. =item * Added C<.globalconst> =item * Added macro parsing example (it is now fixed in languages/PIR). =item * Added reference to official doc for IMCC syntax. =item * Added C<:unique_reg> to allowed flags for incoming parameters. =back 0.1.1 =over 4 =item * Switch to x.y.z version number; many fixes will follow. =item * Added more examples. =item * Fixed some errors. =back 0.1 =over 4 =item * Initial version having a version number. =back =cut