Object schemata --------------- An object schema consists of the following components: * a set of atomic types, usually a subset of Python's builtin types. The default atomic types are string, int, long, float, and complex. In principle, you can add other builtin types (like function, class, or file) or extension types to a schema, but Grouch currently has problems with many builtin types. (In particular, only types whose values can be pickled may be atomic types in Grouch.) * a type alias mapping, letting you define shorthand names for common types. * a set of class definitions. A class definition maps instance attribute names to attribute types. This performs two purposes: it defines the expected set of attributes for instances of a class, and it defines the type of each attribute. In the current version of Grouch, an object schema is defined through a project description file and the class docstrings in a set of source files. This is useful in practice, but it's kind of hard to talk about object schemata without a simple, compact schema description language. Thus, consider the following pseudo-schema: class Thing: name : string class Animal (Thing): num_legs : int furry : boolean (Coincidentally, this is the syntax emitted by gen_schema's "-t" option. However, this is currently a write-only language; Grouch has no way to parse schemata created by "gen_schema -t".) This defines an object schema with no additional atomic types (just the default five: string, int, long, float, and complex), no aliases, and two classes (both, presumably, in the __main__ module, since the class names are unqualified). If you ask Grouch to type-check an instance of Thing under this schema, or if it comes across a Thing instance in the course of type-checking a larger object graph, it does the following: * ensure that the instance has exactly one attribute, 'name' * ensure that the value of this attribute is a string Similarly, Grouch type-checks an Animal instance under this schema as follows: * ensure that it has exactly three attributes, 'name', 'num_legs', and 'furry' (note that 'name' is inherited from Thing) * ensure that the value of 'name' is a string, 'num_legs' an int, and 'furry' a boolean (i.e. 0, 1, or None) Defining an object schema: class docstrings ------------------------------------------- Currently, you define an object schema by writing specially-formatted class docstrings. (There is no separate schema description language... yet.) For example, the Thing class in the above pseudo-schema might be documented as: class Thing: """A single thing, which may be an animal, vegetable, or mineral. The only property common to all things is a name. Instance attributes: name : string the name of the thing """ Grouch (specifically, the gen_schema script that parses these docstrings) ignores everything in the docstring up to the "Instance attributes:" line. After that, things get fairly rigid: * the "Instance attributes:" line must be indented to the same depth as the main body of the docstring * each attribute name is indented two spaces relative to that, and followed by a colon (":") and the attribute's type * attribute descriptions (which are optional, and are ignored by Grouch) are indented a further two spaces * when indentation returns to the same level as the "Instance attributes:" line, Grouch stops processing the docstring and goes on to the next class in the module (thus, blank lines are allowed in the attribute list) Here is a slightly more elaborate example: class Animal (Thing): """An animal, ie. a thing with multiple legs and possibly fur. Instance attributes: num_legs : int the number of legs this animal has furry : boolean whether this animal is furry or not Outsiders should use 'get_num_legs()' and 'is_furry()' to access these attributes. """ Here is a stripped-down version of this docstring that is exactly equivalent as far as Grouch is concerned: class Animal (Thing): """ Instance attributes: num_legs : int furry : boolean """ Sometimes a class will have no instance attributes of its own; Grouch has special syntax for this: class Mammal (Animal): """Instance attributes: none""" This is different from simply omitting the list of instance attributes, or omitting the docstring entirely. If Grouch sees a Mammal instance with any attributes apart from those inherited from Animal, it will complain. However, if Mammal has no docstring or attribute list, Grouch can't do detailed type-checking of instances of that class. Instead, it * complains that the class has no docstring (or no attribute list) * exclude the class from the schema * when type-checking an object graph, complain about any instances of that class it discovers Defining an object schema: the project description file ------------------------------------------------------- Writing class docstrings that document every instance attribute is the key part of defining an object schema. However, you still have to tell Grouch how to find those class docstrings and what to do with them. This is done with the gen_schema script and its project description file. [Searching by directory] At its simplest, the project description file contains a list of directories to search for Python source files, and possibly a prefix to use in turning source filenames into module names. Those directories are interpreted relative to a base directory that you supply to gen_schema. For example, the project description file for Grouch itself (grouch.proj in the top-level Grouch directory) starts out with this: dirs = ["."] prefix = "grouch" (The project description file is just Python code; it's execfile'd by gen_schema.) This instructs gen_schema to search for *.py in the base directory, and to assume that all modules found actually live in the "grouch" package. Hence when it finds schema.py, it considers that module to be "grouch.schema", and a class ObjectSchema in that file will be called "grouch.schema.ObjectSchema". gen_schema does *not* search recursively; if you want it to descend into sub-directories, you must specify them explicitly: dirs = ["compiler", "compiler/parser", "compiler/optimizer"] The directories in 'dirs' are interpreted relative to a base directory supplied with the "-d" (or --base-dir) option to gen_schema. If you run gen_schema from Grouch's top directory, you must specify "lib" as the base directory, since that's where the Grouch source code (schema.py and friends) live: ./scripts/gen_schema -d lib -p grouch.proj The resulting schema will be written (as a pickle) to schema.pkl; if you want a human-readable representation of the schema, add "-t schema.txt" to the command. [Specifying individual modules] If you don't want to search every "*.py" file in a list of directories, you can supply a list of explicit module names, eg.: extra_modules = ["grouch.schema", "grouch.valuetype"] Note that extra_modules is a list of fully-qualified module names, *not* filenames. This variable is called 'extra_modules' because these modules are added to the list of modules found by searching the directories named in 'dirs'. If 'dirs' isn't supplied, the modules in 'extra_modules' are Grouch's only source for class definitions. [Excluding individual modules] You can refine gen_schema's search for classes by excluding certain modules. As an example, Grouch includes a copy of SPARK (John Aycock's nifty parser framework) as the "grouch.spark" module; since this is really someone else's code, it doesn't have Grouch-style docstrings to parse. Also, the parser classes are transient and shouldn't wind up in any persistent store of an Grouch object graph, so there's not much point in type-checking them. Thus, I exclude both grouch.spark and grouch.type_parser (which provides classes derived from the SPARK classes) from gen_schema's scan: exclude_modules = ["grouch.spark", "grouch.type_parser"] Like extra_modules, exclude_modules is a list of fully-qualified module names. [Excluding individual classes] You can also exclude specific classes from the search, instead of whole modules. This is useful if a particular module provides some transient classes and other first-class persistent classes. For example, I might wish to exclude the TypecheckContext class, defined in grouch.context, from schema generation: exclude_classes = ["grouch.context.TypecheckContext"] Again, classes are specified as fully-qualified Python names. [Adding atomic types] If the five default atomic types aren't enough for your project, you'll have to add new ones. This might happen if you use extension types in your application, or if you store slightly odd objects in your persistent object graph, like functions or class objects. New atomic types are specified using an example value, not using the type object itself. (This is necessary because type objects can't be pickled, and gen_schema pickles the schema for future use. We can't store type objects in the pickled schema, so we store sample values instead.) For instance, to add Marc-André Lemburg's DateTime type to your schema, add this to your project definition: import mx.DateTime atomic_types = [mx.DateTime.now()] The structure of 'atomic_types' is a tad complex. Most often, each element of the list is simply a value of the atomic type you want to add to your schema -- eg. here I created a sample DateTime object. Since these sample values go straight into the object schema, which is subsequently pickled by gen_schema, these must be pickle-able values. Grouch probably needs to grow a real schema definition language before you can have, say, Python function or file objects as atomic types in an object schema. (In other words, I think this is an implementation problem due to reliance on pickling rather than a fundamental problem.) In this simple case, the name of the atomic type is implicit, because the type itself supplies its name -- "DateTime" in the above example. (Try "type(mx.DateTime.now()).__name__".) In some cases, though, you may want to specify your own name for an atomic type. In that case, just supply a tuple (sample_value, type_name) in atomic_types. This is useful if you're dealing with ExtensionClass, where every class is a new type. (This is also the case with new-style classes in Python 2.2.) For instance, a ZODB application that needs "class" and "instance" types (for class objects and generic instance objects) might do this: import ZODB from Persistence import Persistent # ... atomic_types = [(Persistent(), "instance"), (Persistent, "class")] If you don't understand why you might need this, you probably don't need it. Putting it all together ----------------------- For a simple example of defining an object schema, take a look in the "examples" sub-directory of Grouch's source distribution. There, you'll find: * the thing.py and animal.py modules, which provide the classes ThingCollection, Thing, Animal, and Mammal * the make_things script, which creates some things, bundles them in a collection, and pickles them to things.pkl * the things.proj project description file, which tells gen_schema how to generate a schema for this project For now, we're just going to generate a schema from the Python source files and things.proj. Later (in "checking.txt", the document that covers type-checking an object graph) we'll run make_things and type-check the results. If you haven't installed Grouch yet, you should either do so now or perpetrate your favourite kludge for ensuring that it's available through sys.path. (If you don't have a favourite kludge, just install it.) Run python -c 'import grouch' to make sure it worked -- if this command completes silently, all is well. Installing Grouch should also install the gen_schema and check_data scripts. I'll assume they're in your shell's PATH; you might have to adjust your PATH or the commands here accordingly. Before we run gen_schema, let's take a look at the ingredients of this project. First, the project description file, things.proj, is quite simple: extra_modules = [("thing", "thing.py"), ("animal", "animal.py")] There's no 'dirs' here, meaning gen_schema won't go searching for "*.py" anywhere. It just looks for the 'thing' module in thing.py, and the 'animal' module in animal.py. Since explicit source filenames are supplied, the 'thing' and 'animal' modules don't have to be in Python's path -- gen_schema simply parses the source files. Next, take a look at thing.py. You'll see that it defines two classes, Thing and ThingCollection, and that the instance attributes of each are fully documented. Similarly, animal.py provides the Animal and Mammal classes. Finally, let's run gen_schema. We'll save the schema for this project to thing_schema.pkl and thing_schema.txt -- the two files have the same content, but only the latter is human-readable. From the "examples" directory, run this: gen_schema -p things.proj -o things_schema.pkl -t things_schema.txt If you're really curious about what's going on here, add the "-v" option. The output of gen_schema (without "-v") should look like this: looking for classes... found 4 classes parsing class docstrings... writing object schema to things_schema.txt... pickling object schema to things_schema.pkl... Take a look at things_schema.txt for a human-readable representation of the schema also saved in things_schema.pkl. Now that we have an object schema for this project, we can use it later to type-check a persistent object graph created by applications that use this project, such as make_things. This will be done in the next document, "checking.txt". $Id: schema.txt 20229 2003-01-16 21:29:07Z akuchlin $