Mird, low-level database library
                                       
     * Introduction
     * Installation
          + Supported platforms
     * License
     * Workflow
          + Opening, syncing and closing the database
          + Transactions
          + Writing to the database
          + Reading from the database
          + Reading a whole table
          + Errors and exceptions
          + Threads
     * Implementation
          + Low-level blocks, superblocks and the free block list
          + Block cache system
          + Fragblocks
          + Data cells
          + The hashtrie
          + String key tables
          + Master table
          + Journal file and transactions
          + Conflicts
          + Freeing blocks
          + Cleaning dirty databases
     * Author
       
Introduction

   Mird is a database library, for operating on simple disk-based
   databases with the following features:
     * The database is key-value oriented, primarily - this is useful for
       most database operations, from inserting and removing lines in
       tables, and for object-oriented data, provided it's fast enough.
       (It should be.) The key can be either a 32-bit integer or a byte
       string of any size. (A integer-key table using 32 bit hash keys is
       used to store the string-key entries.)
     * The database consists of several tables or key-value mappings.
       On this level, the table is identified by a 32-bit integer. It is
       easy enough to implement a table name lookup table, when the need
       arises.
     * Changes to the database is made in the shell of transactions,
       where each transaction encapsulates any change to the database
       atomically.
     * More then one transaction could be in progress at any given
       moment, with resulting conflicts at the closing of the last
       transactions, if they operate on the same data.
     * Conflicts are either based on changes to a key:value pair, but a
       whole table could be locked to mark dependency conflicts.
     * The syncing against disk is normally done first when the database
       is closed or forced to sync; this is when any unused block is
       reused. This gives very high speed.
       (The database could of course be in a state where every
       transaction completion is synced to the disk.)
     * To make this non-syncing secure, the database uses a journal file.
       When something fatal happened to the process using the database
       (power loss, kill -9, whatever), this file is read back and the
       database is restored to the latest state possible.
     * The transaction order is guaranteed; any transaction following any
       failed (ie, broken, not cancelled) transaction will never be seen
       as valid.
     * The database is fast; it is normally no problem to write or read
       several thousand entries per second, if the I/O wait is slow
       enough.
     * The database is unfortunately low-level, there is no high-level
       replicating, memory-only or distributing schemes included. It was
       never inteded to solve these problems, only the speed and
       security.
     * The library is heavily tested by a testsuite, including tests
       simulating power loss and write failures (as in writing the wrong
       data to the database).
       
Installation

   Unpack the archive using tar:
      % tar xvzf mird-1.0.tar.gz

   Configure for your system:
      % ./configure
      creating cache ./config.cache
      configuring in /users/mirar/mird/build/your_machine/
      creating cache ./config.cache
      checking for gcc... gcc
      [...]
      creating Makefile
      %

   Make and install the library:
      % make install
      building in /users/mirar/mird/build/mistel.idonex.se/
      make: Entering directory `/export/home/mirar/mird/build/mistel.idonex.
se'
      gcc -g -O2 -g -Wall -Wno-parentheses -I. -I../../src -c ../../src/blocks.
c -o blocks.o
      [...]
      installing mird.h in /usr/local/include
      ln -sf libmird.so.1.0 /usr/local/lib/libmird.so
      %

   Now the library is installed.
   
   If you want it installed in some other path, run the configure with
      % ./configure --prefix=path
   where path is the base path of installation (the prefix to .../lib and
   .../include). You can also run
      % ./configure --help
   to read about the other options to configure.
   
  Supported platforms
  
   The library is configured by using autoconf. Most Unix- or Posix-like
   platforms should be supported, since the library itself does very
   little system-specific operations. It has been tested to compile and
   run on the following systems:
     * Linux RedHat 4.2 (x86)
     * Solaris 7 (x86, sparc)
       
   If you succeed to run it in any other environment, please report the
   success to the author.
   
   The database file generated does not have any machine dependent data,
   which means that the database itself can be transferred between
   systems with different endian. Everything is stored in network byte
   order.
   
License

   The Mird database library may be freely distributed and used with
   these limitsš:
     * The database library may be distributed with or without source.
     * There is no promise that this software works. (But if you find any
       bugs, please let me know!)
     * You can use this software for whatever you want. You don't have to
       pay me.
     * You may not pretend that you wrote this software. If you use it in
       a program, you must acknowledge it somewhere in your
       documentation.
       
   In legalese:
   
   The authors make NO WARRANTY or representation, either express or
   implied, with respect to this software, its quality, accuracy,
   merchantability, or fitness for a particular purpose. This software is
   provided "AS IS", and you, its user, assume the entire risk as to its
   quality and accuracy.
   
   This software is copyright (C) 1999-2000, Mirar Hagland All Rights
   Reserved except as specified below.
   
   Permission is hereby granted to use, copy, modify, and distribute this
   software (or portions thereof) for any purpose, without fee, subject
   to these conditions:
   (1) If any part of the source code for this software is distributed,
   then this license must be included, with this copyright and
   no-warranty notice unaltered; and any additions, deletions, or changes
   to the original files must be clearly indicated in accompanying
   documentation.
   (2) If only executable code is distributed, then the accompanying
   documentation must state that "this software is utilizing the Mird
   database".
   (3) Permission for use of this software is granted only if the user
   accepts full responsibility for any undesirable consequences; the
   authors accept NO LIABILITY for damages of any kind.
   
   These conditions apply to any software derived from or based on the
   Mird database code, not just to the unmodified library. If you use
   this work, you ought to acknowledge it.
   
   I permit and encourage the use of this software as the basis of
   commercial products, provided that all warranty or liability claims
   are assumed by the product vendor.
   
   The Unix configuration script "configure" was produced with GNU
   Autoconf. It is copyright by the Free Software Foundation but is
   freely distributable. The same holds for its supporting scripts
   (config.guess, config.sub, ltconfig, ltmain.sh). Another support
   script, install-sh, is copyright by M.I.T. but is also freely
   distributable.
   
   š this legalese is stolen from the JPEG license
   
Workflow

  Opening, syncing and closing the database
  
   Opening the database consists of two steps, first to create the
   database object structure, then to create, open or restore1 the
   database file.
     * initialize the structure, ie call mird_initialize,
     * change the parameters in the structure, see struct mird,
     * open the database, ie call mird_open.
       ...
     * operate on the database
     * while opened, the database and journal file may be synced by
       calling mird_sync or mird_sync_please.
       ...
     * close the database and free the structure by calling
       mird_close.
       
   Example:
   

      MIRD_RES res;
      struct mird *my_database;

      /* initialize the structure */
      if ( (res=mird_initialize("my_database_file.mird",&my_database)) )
         [...do something with the error...]

      /* change the default parameters if necessary */
      my_database->flags|=MIRD_NOCREATE;
      my_database->cache_size=1024;

      /* open the database file */
      if ( (res=mird_open(my_database)) )
         [...do something with the error...]

      [...]

      /* upon need,                                   */
      /* sync the database and flush the journal file */
      if ( (res=mird_sync(my_database)) )
         [...do something with the error...]

      [...]

      /* close the database */
      if ( (res=mird_close(my_database)) )
         [...do something with the error...]

   Please note that you neither need to change or setup the database
   parameters, or sync the database. On a long-lived application (a
   server, for instance) the mird_sync operation may help to keep
   down the size of the files, though, since both the database and
   journal file only may grow up to the point where the database is
   synced or closed. At that point, the database starts to reuse any free
   space generated from the previous operations and the journal file is
   restarted.
   
   Also note that the close or sync operations shouldn't be too often.
   The operations to find free space are summarily quicker if there is
   more data to work on at the same time.
   
     1) A database file can be in three states, non-existant, clean or
     dirty. It is dirty when a previous process has not closed the
     database nicely, but rather died in some way. In this case, the
     journal file is still there, and the database is restored to the
     lastest state possible.
     

  Transactions
  
   Any change to the database must be in the scope of a transaction. To
   change anything in the database, you must therefore first create a
   transaction, in which you do the change to the database, and then
   close it:
     * open the transaction, ie call mird_transaction_new
       ...
     * operate on the transaction, to do the changes
       ...
     * close or cancel the transaction, ie call
       mird_transaction_close or mird_transaction_cancel
       
   The result of a transaction close can be a conflict. In this case, the
   transaction must be cancelled.
   
   Example:
   
      struct mird_transaction *mtr;

      /* begin a new transaction */
      if ( (res=mird_transaction_new(my_database,&mtr)) )
         [...do something with the error...]

      [...operate on the transacton ...]

      /* close the transaction */
      if ( (res=mird_transaction_close(mtr)) )
      {
         if (!MIRD_IS_CONFLICT(res))
              [...do something with the error...]
         [...do something with the conflict...]
         mird_free_error(res);
         if ( (res=mird_transaction_cancel(mtr)) )
            [...do something with the error...]
      }

   Several transactions can be going on at the same time. If you doesn't
   have more then one transactions at the same time, you can never get
   conflicts.
   
   The transaction scope can be used to read the database, too. It keeps
   the scope the database had until it's closed (or synced against
   the current state in any other way).
   
   Note that the database cannot be synced when there is ongoing
   transactions.
   
  Writing to the database
  
   Within a transaction, changes to the database is possible.
   
   The database is table-oriented, so the first thing you will have to do
   with a new database, is to create the tables you need.
   A table may either be a mapping from an integer (32-bit) key to an
   any-length string, or from a string to a string.
   To create a table, call mird_key_new_table or
   mird_s_key_new_table:
   
       /* create table 17, integer key to string value table */
       if ( (res=mird_key_new_table(mtr,17)) )
          [...error...]

       /* create table 18, string key to string value table */
       if ( (res=mird_s_key_new_table(mtr,18)) )
          [...error...]

   Note that if the table pre-exists, this call fails and the resulting
   exception will be MIRD_TABLE_EXISTS.
   
   When the tables exists, you will want to write down key:value pairs in
   them. This is done by calling mird_key_store, or for string key
   tables, mird_s_key_store:
   
       /* store an integer key : string value in the table 17 */
       mird_key_t     int_key;
       unsigned char *value;
       mird_size_t    value_len;

       if ( (res=mird_key_store(mtr,17,int_key,value,value_len)) )
          [...error...]

       /* store a string key : string value in the table 18 */
       unsigned char *string_key;
       mird_size_t    key_len;
       unsigned char *value;
       mird_size_t    value_len;
       if ( (res=mird_s_key_store(mtr,18,string_key,key_len,value,value_len
)) )
          [...error...]

   Any key can be written any number of times in a transaction. Conflicts
   appear at first when the transaction is closed.
   
  Reading from the database
  
   Reading from the database is done in a similar way as writing to the
   database, by calling mird_key_lookup or mird_s_key_lookup, for
   integer and string keys, respectively.
   When the data read is no longer used, it must be freed by calling
   mird_free.
   
       unsigned char *value;
       mird_size_t    value_len;

       /* read from an integer key table (table 17) */
       if ( (res=mird_key_lookup(my_database,17,int_key,&value,&value_len))
 )
          [...error...]

       if (!value)
          [...action if there was no value from that key...]

       /* free the resulting value */
       if (value) mird_free(value);

       [...]

       /* read from a string key table (table 18) */
       if ( (res=mird_s_key_lookup(my_database,18,string_key,key_len,&value
,&value_len)) )
          [...error...]

       if (!value)
          [...action if there was no value from that key...]

       /* free the resulting value */
       if (value) mird_free(value);

   Note that if the key doesn't exist in the table, the resulting value
   pointer will be NULL. If the table does not exist, or is the wrong
   type, this will result in the MIRDE_NO_TABLE respectively
   MIRDE_WRONG_TABLE exceptions.
   
   Note also that mird_free may not fail.
   
   If you need to read from the database in one and the same state, you
   can create a transaction to read in. This will keep the same state as
   the database were when the transaction was created, and may be read
   from by using mird_transaction_key_lookup and
   mird_transaction_s_key_lookup.
   
  Reading a whole table
  
  Errors and exceptions
  
   On some occations, the database functions will return an exception. If
   the return value is NULL, everything went fine.
   
   Otherwise, you may wish to handle this error in one way of other. To
   do this correctly, you may look at the elements in the strucure,
   and/or use the functions and macros available to examine an error;
   mird_describe_error, mird_perror and the convinience macro
   MIRD_IS_CONFLICT.
   
   The returned type MIRD_RES is a pointer to a structure containing the
   following elements:
    int error_no;
    char *s;
    unsigned x,y,z;

   error_no contains the error number the exception had, see the
   error list in the reference manual. s, x, y and z contains different
   information depending on the errors nature (from filenames and block
   numbers to table identifier and keys in conflicts). This is also
   described in the reference manual.
   
   mird_describe_error and mird_perror converts the error to
   something a little bit more human-readable.
   
   An error handling situation can look like this:
      /* do something */
      if ( (res=mird_operation([...])) )
      {
         /* check if it is something we expected */
         if (res->error_no==MIRD_NO_TABLE)
         {
            [ ... do something about that ... ]
            mird_free_error(res);
         }
         else
         {
            /* it isn't something we expected,         */
            /* print the error and quit immediately    */

            mird_perror("my_program (doing some operation)",res);
            exit(1);
         }
      }

   If you get an exception, but continues to run the application, you
   must always free the exception. This is done by calling
   mird_free_error (which may not fail).
   
  Threads
  
   The database library is not thread safe. Almost any operation on the
   database must lock the whole database, so this is the solution I
   recommend to use in a threaded application. The library is thread safe
   if you avoid working on the same database structure, so you can have
   more then one database open at the same time and open on both of them.
   (There is no global variables in the library itself.)
   
Implementation

   This chapter is for you out there that really wants to know how this
   library works, and is by all means neccesary to use the library -
   although it could be enlightening. The database engine is divided up
   in several distinct areas, from low-level storage to handling
   transactions and tables:
   +---------------------------------------------------+
   |             low-level block storage               |
   |------------------------------------------|--------|
   |           block cache system             |        |
   |                             |------------|        |
   |                             | free block | super  |
   |-------------------------| |-|----| list  | blocks |
   |  frag block system      | |      |       |        |
   |------------------|------| |------|--| |--|        |
   |     hashtries    | data cells       | |  |        | |--------------|
   |------------------|------------------| |--|        | | journal file |
   |            tables                   | |  |        | |              |
   |  |----------------------------------| |----| |------|--------|     |
   |  |       transactions                      | |               |     |
   |--|-----------------------------------------|-|---------------|-----|
   |            database                                                |
   |--------------------------------------------------------------------|

  Low-level blocks, superblocks and the free block list
  
   The database is based on blocks. A block can be in four states, it can
   be
     * database system block, ie, a free list block or a superblock
     * used block, either a full cell block or a frag block
     * unused block
     * block beyond the tail of the database file
       
   All blocks are of constant size, per default 2048 bytes. This makes
   the database less sensitive to fragmentation.
   
   The superblocks contain information of the database, like where to
   find information about the tables, what the block size are, and other
   various neccesary knowledge to read a database. Every 4^n-1 block is a
   superblock (block 0, block 3, block 15, ...). They also contain
   information to tell if the database is clean or in dirty mode - the
   later means that the journal file must be read back at database open.
   
   The free list blocks contains lists of free blocks, that can be
   reused. The superblock has information on how big the file is, so if
   the datpabase runs out of blocks to reuse, it makes the file bigger
   instead.
   
   The free list is never appended, until the database is closed or
   synced. No block can be reused until the database is synchronized with
   the disk. Unfortunately, this makes the database grow up to this point
   - but beyond this point, the unused space is reused.
   
  Block cache system
  
   Directly below the low-level block storage, there is a block cache
   system; various other levels use this system as both read- and write
   cache.
   
   The cache system is a closed hash table of blocks, that has been read
   to memory. When a block is needed, it first checks the cache for the
   block, and if it isn't in the cache, it disposes of a block in the
   cache, and reuses it's space. What block is disposed of is random.
   
   When a block is written to, it is also read to the cache, or allocated
   in the cache if it's a new block, the cache data pointer is returned,
   and the block is marked as dirty. A dirty block that has to be reused
   in the cache is saved to disk before reuse.
   
   Later, when a transaction is completed, the transactions blocks still
   marked dirty in the cache is flushed to disk.
   
  Fragblocks
  
   For small segments of data, a complete block isn't used. More then one
   small piece of data can be stored in a special block, called
   fragblock. The fragblock has a table of contents at the start.
   
   Any piece of data has an address to either a fragblock or a block of
   it's own. Some bits are used to address the block itself, and a few
   bits are used to address the data in a a fragblock, per default 27
   respective 5 bits. If the later bits are zero, the address is a block
   of its own, but any other value is pointing to an address in a frag
   block. This gives 2^n-1 possible data pieces in a frag block, per
   default 2^5-1 = 31 pieces.
   
  Data cells
  
   A piece of data can either fit in a fragblock or a big block, or not
   fit at all. In this case, the data cell takes up more then one block
   (or one or some block and a piece of a fragblock). These blocks is
   ordered in a linked list. The allocation system is supposed to make
   these blocks allocated in the right order, if possible, so that
   readback will be forward, allowing read-cache optimization in the
   operative system to work more efficient.
   
  The hashtrie
  
   The integer-key lookup table is a heavily branching tree - I think the
   data structures is called hashtrie. The first node branches off on the
   lowest bits of the hash key, the next node branches off on the second
   lowest bits, etc. In the end, the leaf is always a data cell. The cell
   itself contains the hash key it's supposed to map from, so a full tree
   is never neccesary.
                         top node
                            |
                  +----+----+-- ... --+
     bits value   0    1    2        n-1
                  |    |    |         |
                 node node node      node
                       |
             +----+----+-- ... --+
             0    1    2        n-1
             |    |    |         |
            node node node      node
                       |
               leaf with data cell
              (in this example the key has
               2*n+1 as the lowest bits)

   This structure makes it possible to have a huge hashtable (or integer
   key table, as it is in reality), without having to rewrite the whole
   table for any minor change. In a write to the table, the database only
   writes down the nodes that has change, and the new nodes neccesary to
   branch the tree.
   
   It also makes a rather fast way of reading back the keys, which will
   be in logarithmic time. Per default there is 32 branches per node,
   which gives a write or readback time of approximately c + dˇlog32 n,
   where n is the number of keys in the table, and c and d is constant.
   This is efficient enough in about any case.
   
   Example: We want to find the key 4711 in a database that is using 5
   bits per hashtrie node (32 branches).
     * First, we find the top node. It's a hashtrie, so we figure out
       what branch we will take here. 4711&(2^5-1) is 7, so that is the
       branch we will take.
     * This branches off to another node, so we will have to figure out
       what branch we will take again. Since we worked off 5 bits
       already, this time the branch number will be (4711>>5)&(2^5-1),
       147&31, branch number 19.
     * That branch points to a data cell, so we start reading this cell.
       It tells us the key that the cell contains is 4711, which we were
       looking for, so we have found our data.
       If the cell key had been different to 4711, or any branch didn't
       lead anywhere, we would know our key wasn't in this table.
       
   Writing the key is a bit more obstacled, since we can't use any other
   transactions already existing nodes if we have to write in them. (See
   below.)
   For each node we have to change, we have to allocate some data space
   and copy the old node there, before writing to it - if we didn't
   create or write in the node already. This of course makes the node
   directly above this node affected to change, so the whole chain back
   to the top node has to be rewritten.
   
   This again makes it possible to control the whole database by knowing
   just the top node, and that makes it possible to cancel a transaction,
   which is neccesary for atomicity and isolation between transactions,
   which is quite useful.
   
  String key tables
  
   String key tables, which can be quite more useful then the normal
   integer key or hash key table, is handled through a special case of
   this hashtrie. The string key is converted to a hash key, which is
   used in a normal hashtrie table, but the key is stored in the data
   cell, and it is allowed to keep more then one key:value pair in the
   data cell. This is neccesary; of course more then one string can be
   converted to the same hash key (there are more then 2^32 strings
   possible, the string can be any length), but not very often - that is
   the main function of the string-to-hash-key conversion, so the
   database is still very efficient.
   
  Master table
  
   The database can contain more then one of these tables, and that is
   controlled by another special case of the above described hashtrie,
   called the master table. The only difference in this table is that the
   leafs of this table are not data, but table descriptors. The table
   descriptor tells us where the root of the table are, and what kind of
   table it is.
   
  Journal file and transactions
  
  Conflicts
  
  Freeing blocks
  
  Cleaning dirty databases
  
Author

   The author is Mirar.
   
   I can be contacted either on <mirar@mirar.org> or at
   <mirar@lysator.liu.se>.
   I'm a nice guy, please praise me for my work and report bugs so the
   database can be even more stable.
   
   Mirds homepage is http://www.mirar.org/mird/, where any new
   release can be downloaded.