.Dd April 5, 2005 .Dt mm 1 .Sh NAME .Nm MiniMunger .Nd Language for writing text-processing filters .Sh SYNOPSIS .Nm mm.munger Ao source-file Ac .Sh DESCRIPTION MiniMunger is a properly tail-recursive, compilable subset of Munger(1) with first-class continuations, but without first-class lists, first-class symbols, local side-effects, macros, "eval", "extend", nor runtime error-checking. MM is specialized for, and limited to, writing filters. This manual page describes only the differences between Munger and MM. For more information, see the Munger manual page. .Pp Some example programs are included in the source distribution and installed in /usr/local/share/munger-4.143/. .Bl -tag -width "transform.munger" .It grep.mm is an egrep-like filter. .It fmt.mm is a fmt-like filter. .It options.mm helps process command-line arguments. .It stacks.mm provides functions to apply functions to the elements of stacks. .It cgi.mm provides functions to simplify CGI scripts. .El .Pp .Sh IMPLEMENTATION NOTES The MM compiler is a whole-program compiler, reading one input file, and producing two output files of C code, which may then be compiled by the C compiler and linked with the MM object files to produce an executable. Instructions for using the compiler follow this section. .Bl -bullet .It MM does not support lists nor any list-related functions. Programs are written as S-expressions, but programs may not create S-expressions. The standard aggregate type of MM is the stack, a dynamically-resizable, one-dimensional array. .It Side-effects are only permissable on globals. .It The first-class data types supported are: stacks, tables, records, closures, continuations, compiled regular expressions, 8-bit-clean strings and fixnums. A fixnum is 1 bit less in size than the size of a C int on the hardware on which MM is running. .It The MM runtime does no type-checking for maximum execution speed. If you call an intrinsic with an argument of the wrong type, your program will crash. .It Although lambda-expressions can be bound to variables in "let" and "letn" forms, there is no "labels" nor "letf" to allow temporary functions to see their own bindings. Any function which calls itself must have a toplevel binding. MM forces the programmer to break out all but the most trivial helper functions, into separately-defined functions. .It There are no looping constructs. All iteration is done via recursion. CPS conversion is performed during compilation, turning all calls into tailcalls. Despite this, tail-recursion will be more space-efficient than recursion from non-tail positions, because the CPS-converted code of functions which recurse in non-tail positions will create closures to capture state. The stack won't grow, but the heap will. .It First-class continuations are captured with the "call_cc" intrinsic which behaves like "call/cc" does in Scheme. .It User-defined functions have fixed-size argument lists. .It User-defined macros are not supported. .El .Sh COMPILING MM PROGRAMS The MM compiler is written in Munger, and compiles MM code to an intermediate language, defined as a set of macros in the source of the C runtime. Most of the macros expand to in-line code for speed, resulting in larger executables than might be expected from the size of the original MM programs. .Pp To compile a MM program the compiler must be invoked on the main source file. The following invocation of MM will compile the grep.mm example program: .Bd -literal -offset left /usr/local/share/munger-4.143/mm.munger /usr/local/share/munger-4.143/grep.mm .Ed .Pp The compiler will take some time to perform source-to-source conversions before it begins to emit code, printing status messages as it does so. When it has finished, two files will have been created, one named "functions.c" and one name "functions.h". To create an executable from these files one must invoke the C compiler on them and the MM object files, specifying the location of the mm.h header with the -I compiler option: .Bd -literal -offset left cc -o grep functions.c -lminimunger -L/usr/local/lib -I/usr/local/include .Ed .Pp The main source file of a program may include other source files with the "include" directive. The "include" directive resembles its similarly-named C preprocessor counterpart, and consists of the word "include" preceded by an octothorpe (#), and succeed by a double-quote delimited filename. For example: .Bd -literal -offset left #include "options.mm" .Ed .Pp If the filename itself contains double quotes, they do not need to be escaped. Include directives must start in column zero to be recognized. Otherwise, they will be treated as comments. Included files themselves may also "include" other source files. A compiler macro is defined to allow programs to know where the MM library files have been installed. The following "include" statement will include "options.mm" from MM library directory: .Bd -literal -offset left #include "%%LIBRARY_PATH%%/options.mm" .Ed .Pp To determine the version number of the compiler, an argument of __VERSION__ may be passed to mm. It will print the version number and exit. This can be used in Makefiles to determine the correct path to the MiniMunger runtime and support files. .Pp .Sh DEBUGGING MM PROGRAMS To debug MM programs with gdb, you must create a debugging copy of the runtime. You can build the debugging runtime by invoking: .Bd -literal -offset left make mmdebug .Ed .Pp in the Munger source directory. The current compiler will not detect two types of errors. The first type of error is the referencing of a variable before it has been initialized with "setq". The reference will cause a global variable to be automatically created and initialized to a NULL pointer internally in your program. Attempting to access the NULL pointer will crash your program, as the runtime engine performs no type-checking for maximum speed. There are comments placed in the file functions.c, output by the compiler, beside each variable reference, naming the accessed variable, which will help the programmer find and fix errors using gdb. The second type of error is the calling of a intrinsic function with the arguments of the wrong type. This error will also cause your program to crash. .Pp When debugging, the source displayed will consist mostly of the C macros which define the intermediate language output by the compiler. The programmer may find it useful to run the C preprocessor over functions.c first, separately, to generate an expanded source file, and then compile that, in order to see the actual C being executed while tracing programs. .Bd -literal -offset left cc -E -o grep.c functions.c -I/usr/local/include cc -o grep -ggdb grep.c -lminimunger -L/usr/local/lib -I/usr/local/include .Ed .Sh THE INTRINSICS The MM intrinsics bear strong resemblence to their similarly-named Munger counterparts. Some behave differently. Some accept a differing number of arguments. Some accept differing types of arguments. Some have different names. The differences, in all cases, however, are minor. This summary does not completely document the operation of the intrinsic functions, but merely lists which are available and how they differ from their Munger counterparts. For complete documentation of an intrinsic, see the Munger(1) manual page. .Ss Control Flow / Side-Effects The empty string and 0 are boolean false values. All other objects are considered boolean true values. The forms below function identically to their Munger counterparts, with the exception of the conditionals. Note that "setq" is the only means of accomplishing side-effects on variables, and that side-effects are only permissable upon globals. .Pp When "if" is invoked with only a "true" subsequent clause, and the test condition evaluates to a false value, 0 is returned, and not the value of the failed test condition. Similarly, if all test clauses of an invocation of "cond" fail, then 0 is returned, rather than the value of the last failed test condition. Both "when" and "unless" also return 0 if their test conditions fail. .Bl -column -offset left "unless" "(letn ((symbol expr)+) expr+)" .It Sy Form Ta Sy Use .It Li setq Ta (setq symbol expr) .It Li if Ta (if test expr1 expr2 ...) .It Li cond Ta (cond (test_expr subsequent ...)+ ) .It Li when Ta (when test expr ...) .It Li unless Ta (unless test expr ...) .It Li progn Ta (progn expr ...) .It Li eq Ta (eq expr1 expr2) .It Li or Ta (or expr ...) .It Li and Ta (and expr ...) .It Li not Ta (not expr) .It Li let Ta (let ((symbol expr)+) expr+) .It Li letn Ta (letn ((symbol expr)+) expr+) .It Li exit Ta (exit expr) .It Li die Ta (die ...) .El .Pp call_cc is used to capture the current continuation. It functions exactly as call/cc does in Scheme: .Bl -column -offset left "call_cc" "(call_cc monadic_function)" call_cc (call_cc monadic_function) .El .Pp .Ss Regular Expressions Note that "regexpp" in Munger is "regexp" in MM. .Bl -column -offset left "substitute" "(substitute rx rep str count)" "0 or stack of 2 fixnums" .It Sy Intrinsic Ta Sy Use Ta Sy Return Value .It Li regcomp Ta (regcomp str) Ta compiled rx .It Li match Ta (match rx str) Ta 0 or stack of 2 fixnums .It Li matches Ta (matches rx str) Ta stack of 20 strings .It Li substitute Ta (substitute rx rep str cnt) Ta string .It Li regexp Ta (regexp expr) Ta 0 or 1 .El .Ss Tables Note that the "hash" and "unhash" intrinsics of Munger become "associate" and "dissociate", and that both return the affected table. .Bl -column -offset left "dissociate" "(associate table expr1 expr2)" "associated expr" .It Sy Intrinsic Ta Sy Use Ta Sy Return Value .It Li table Ta (table) Ta new table .It Li tablep Ta (tablep expr) Ta 0 or 1 .It Li associate Ta (associate table expr1 expr2) Ta table .It Li dissociate Ta (dissociate table expr1) Ta table .It Li lookup Ta (lookup table expr) Ta associated expr .It Li keys Ta (keys table) Ta stack of keys .It Li values Ta (values table) Ta stack of values .El .Ss Stacks Note that the "unshift", "push", and "store", intrinsics all return the affected stack instead of their second arguments. .Pp The "exec_stack" and "join_stack" intrinsics both take a stack of strings as argument. .Pp "exec_stack" treats the first element as a program to be fed to the execvp() system call, and the remaining elements as the arguments to that program. Note that the MM runtime will automatically ensure that the first element is also included in the argument list, in order to adhere to the UNIX convention that the first argument to a program be the name under which it was invoked. .Pp "join_stack" joins a stack of strings together, treating the first element of the stack as a delimiter to place in between each of the other elements in the string being composed. .Pp The "append" intrinsic appends one or more stacks into a single stack. The function creates a new stack and fills it will all the members of all its arguments, in order. The "substack" intrinsic returns a contiguous subset of the elements of a stack, as a new stack. The first must evaluate to a stack, while the second and third arguments must evaluate to numbers specifying the range of indices to be included in the substack. .Bl -column -offset left "sort_numbers" "(substack stack expr expr)" "item at index expr" .It Sy Intrinsic Ta Sy Use Ta Sy Return Value .It Li stack Ta (stack) Ta new stack .It Li shift Ta (shift stack) Ta item at index 0 .It Li unshift Ta (unshift stack expr) Ta stack .It Li push Ta (push stack expr) Ta stack .It Li pop Ta (pop stack) Ta item at topidx .It Li assign Ta (assign stack ...) Ta stack .It Li append Ta (append stack ...) Ta new stack .It Li substack Ta (substack stack expr expr) Ta new stack .It Li index Ta (index stack expr) Ta item at index expr .It Li store Ta (store stack fixnum expr) Ta stack .It Li clear Ta (clear stack) Ta stack .It Li used Ta (used stack) Ta stored item count .It Li sort_numbers Ta (sort_numbers stack) Ta stack (sorted in situ) .It Li sort_strings Ta (sort_strings stack) Ta stack (sorted in situ) .It Li topidx Ta (topidx stack) Ta index of top item .It Li exec_stack Ta (exec_stack stack) Ta Does not return. .It Li join_stack Ta (join_stack stack) Ta string. .It Li stackp Ta (stackp expr) Ta 0 or 1 .El .Ss Records .Bl -column -offset left "Intrinsic" "(setfield expr1 expr2 expr3)" "new record of size n" .It Sy Intrinsic Ta Sy Use Ta Sy Return Value .It Li record Ta (record n) Ta new record of size n .It Li setfield Ta (setfield expr1 expr2 expr3) Ta expr3 .It Li getfield Ta (getfield expr1 expr2) Ta item in pos expr2 .El .Ss Fixnums Each of these functions accept only TWO arguments, unlike their Munger counterparts. Note that "=" is actually a synonym for "eq". .Bl -column -offset left "Intrinsic" "(>= expr1 expr2)" "absolute value" .It Sy Intrinsic Ta Sy Use Ta Sy Return Value .It Li = Ta (= expr1 expr2) Ta 0 or 1 .It Li < Ta (< expr1 expr2) Ta 0 or 1 .It Li <= Ta (<= expr1 expr2) Ta 0 or 1 .It Li > Ta (> expr1 expr2) Ta 0 or 1 .It Li >= Ta (>= expr1 expr2) Ta 0 or 1 .It Li + Ta (+ expr1 expr2) Ta sum .It Li - Ta (- expr1 expr2) Ta difference .It Li * Ta (* expr1 expr2) Ta product .It Li % Ta (% expr1 expr2) Ta remainder .It Li / Ta (/ expr1 expr2) Ta quotient .It Li abs Ta (abs expr) Ta absolute value .El .Pp Note that "stringify" accepts only one argument, which must evalute to a fixnum. .Bl -column -offset left "stringify" "(stringify expr)" "string representation of expr" .It Sy Intrinsic Ta Sy Use Ta Sy Return Value .It Li stringify Ta (stringify expr) Ta string representation of expr .It Li numberp Ta (numberp expr) Ta 0 or 1 .It Li char Ta (char expr) Ta one-character string .El .Pp .Ss I/O Theses are the general I/O functions. "print" and "println" are compiler macros which insert code to call "display" for each of their arguments. Note that both "getline" and "reachars" return the empty string, instead of 0, upon encountering EOF. "display_error" and "newline_error" send their output to the standard error. .Bl -column -offset left "display_error" "(display_error expr)" "value of last expr" .It Sy Intrinsic Ta Sy Use Ta Sy Return Value .It Li print Ta (print expr ...) Ta value of last expr .It Li println Ta (println expr ...) Ta 1 .It Li display Ta (display expr) Ta value of expr .It Li display_error Ta (display_error expr) Ta value of expr .It Li newline Ta (newline) Ta 1 .It Li newline_error Ta (newline_error) Ta 1 .It Li die Ta (die ...) Ta does not return .It Li warn Ta (warn expr ...) Ta value of last expr .It Li getline Ta (getline) Ta string .It Li readchars Ta (readchars expr) Ta string .El .Pp These are the intrinsics redirecting the standard descriptors onto files and processes. These functions return 1 upon success, or a string describing an error condition. .Bl -column -offset left "with_output_file_appending" "(with_output_file_appending file expr ...)" .It Sy Intrinsic Ta Sy Use .It Li pipe Ta (pipe desc program) .It Li with_input_process Ta (with_input_process program expr ...) .It Li with_output_process Ta (with_output_process program expr ...) .It Li redirect Ta (redirect desc file appending) .It Li with_input_file Ta (with_input_file file expr ...) .It Li with_output_file Ta (with_output_file file expr ...) .It Li with_output_file_appending Ta (with_output_file_appending file expr ...) .It Li resume Ta (resume desc) .El .Pp .Ss System-Related "random" returns a fixnum in the range of 0 to one less than its argument. "setenv" returns the value of the setenv system call, therefore 0 indicates success. Because fixnums are 31 bits in width on a 32-bit machine, it is impossible to have the "time" intrinsic return a fixnum, so it returns a string padded with leading zeroes until it occupies sixteen characters. The "stat" intrinsic returns a five element stack, containing all strings: owner name or uid, group name or uid, time of last access, time of last modification, and size. The last three values are all padded with leading zeros to become sixteen-character strings, so that they may be compared with "strcmp" to "time" values and each other, to determine which represents an earlier time. .Bl -column -offset left "directory" "(getenv str str)" "stack of filenames" .It Sy Intrinsic Ta Sy Use Ta Sy Return Value .It Li basename Ta (basename path) Ta string .It Li dirname Ta (dirname path) Ta string .It Li directory Ta (directory expr) Ta stack of filenames .It Li symlink Ta (symlink from to) Ta 0 or error string .It Li rename Ta (rename from to) Ta 0 or error string .It Li remove Ta (remove expr) Ta 0 or error string .It Li stat Ta (stat expr) Ta stack or error string .It Li setenv Ta (setenv str str) Ta fixnum .It Li getenv Ta (getenv string) Ta string or 0 .It Li system Ta (system string) Ta 0 or error code .It Li exec Ta (exec expr) Ta does not return .It Li fork Ta (fork) Ta same as fork(2) .It Li time Ta (time) Ta string .It Li date Ta (date) Ta string .It Li random Ta (random expr) Ta fixnum .El .Ss Command-Line Args These function identically to their Munger counterparts. .Bl -column -offset left "previous" "(previous)" "0 or string" .It Sy Intrinsic Ta Sy Use Ta Sy Return Value .It Li next Ta (next) Ta 0 or string .It Li previous Ta (previous) Ta 0 or string .It Li current Ta (current) Ta string .It Li rewind Ta (rewind) Ta string .El .Ss Strings The ability of the "split" intrinsic in Munger to explode a string into a list of one-character strings, is not present in the MM "split". The "explode" intrinsic does this. .Bl -column -offset left "expand_tabs" "(substring string expr1 expr2)" "stack of strings" .It Sy Intrinsic Ta Sy Use Ta Sy Return Value .It Li chop Ta (chop expr) Ta string .It Li chomp Ta (chomp expr) Ta string .It Li length Ta (length expr) Ta fixnum .It Li digitize Ta (digitize expr) Ta fixnum .It Li code Ta (code expr) Ta fixnum .It Li explode Ta (explode expr) Ta stack of strings .It Li stringp Ta (stringp expr) Ta 0 or 1 .It Li join Ta (join delim expr ...) Ta string .It Li split Ta (split delims string) Ta stack of strings .It Li concat Ta (concat expr1 expr2 ...) Ta string .It Li substring Ta (substring string expr1 expr2) Ta string .It Li strcmp Ta (strcmp expr1 expr2) Ta fixnum .It Li expand_tabs Ta (expand_tabs expr1 string) Ta string .El .Sh AUTHORS .An James Bailie Aq jimmy@mammothcheese.ca .br http://www.mammothcheese.ca