Mird, low-level database library * Introduction * Installation + Supported platforms * License * Workflow + Opening, syncing and closing the database + Transactions + Writing to the database + Reading from the database + Reading a whole table + Errors and exceptions + Threads * Implementation + Low-level blocks, superblocks and the free block list + Block cache system + Fragblocks + Data cells + The hashtrie + String key tables + Master table + Journal file and transactions + Conflicts + Freeing blocks + Cleaning dirty databases * Author Introduction Mird is a database library, for operating on simple disk-based databases with the following features: * The database is key-value oriented, primarily - this is useful for most database operations, from inserting and removing lines in tables, and for object-oriented data, provided it's fast enough. (It should be.) The key can be either a 32-bit integer or a byte string of any size. (A integer-key table using 32 bit hash keys is used to store the string-key entries.) * The database consists of several tables or key-value mappings. On this level, the table is identified by a 32-bit integer. It is easy enough to implement a table name lookup table, when the need arises. * Changes to the database is made in the shell of transactions, where each transaction encapsulates any change to the database atomically. * More then one transaction could be in progress at any given moment, with resulting conflicts at the closing of the last transactions, if they operate on the same data. * Conflicts are either based on changes to a key:value pair, but a whole table could be locked to mark dependency conflicts. * The syncing against disk is normally done first when the database is closed or forced to sync; this is when any unused block is reused. This gives very high speed. (The database could of course be in a state where every transaction completion is synced to the disk.) * To make this non-syncing secure, the database uses a journal file. When something fatal happened to the process using the database (power loss, kill -9, whatever), this file is read back and the database is restored to the latest state possible. * The transaction order is guaranteed; any transaction following any failed (ie, broken, not cancelled) transaction will never be seen as valid. * The database is fast; it is normally no problem to write or read several thousand entries per second, if the I/O wait is slow enough. * The database is unfortunately low-level, there is no high-level replicating, memory-only or distributing schemes included. It was never inteded to solve these problems, only the speed and security. * The library is heavily tested by a testsuite, including tests simulating power loss and write failures (as in writing the wrong data to the database). Installation Unpack the archive using tar: % tar xvzf mird-1.0.tar.gz Configure for your system: % ./configure creating cache ./config.cache configuring in /users/mirar/mird/build/your_machine/ creating cache ./config.cache checking for gcc... gcc [...] creating Makefile % Make and install the library: % make install building in /users/mirar/mird/build/mistel.idonex.se/ make: Entering directory `/export/home/mirar/mird/build/mistel.idonex. se' gcc -g -O2 -g -Wall -Wno-parentheses -I. -I../../src -c ../../src/blocks. c -o blocks.o [...] installing mird.h in /usr/local/include ln -sf libmird.so.1.0 /usr/local/lib/libmird.so % Now the library is installed. If you want it installed in some other path, run the configure with % ./configure --prefix=path where path is the base path of installation (the prefix to .../lib and .../include). You can also run % ./configure --help to read about the other options to configure. Supported platforms The library is configured by using autoconf. Most Unix- or Posix-like platforms should be supported, since the library itself does very little system-specific operations. It has been tested to compile and run on the following systems: * Linux RedHat 4.2 (x86) * Solaris 7 (x86, sparc) If you succeed to run it in any other environment, please report the success to the author. The database file generated does not have any machine dependent data, which means that the database itself can be transferred between systems with different endian. Everything is stored in network byte order. License The Mird database library may be freely distributed and used with these limits¹: * The database library may be distributed with or without source. * There is no promise that this software works. (But if you find any bugs, please let me know!) * You can use this software for whatever you want. You don't have to pay me. * You may not pretend that you wrote this software. If you use it in a program, you must acknowledge it somewhere in your documentation. In legalese: The authors make NO WARRANTY or representation, either express or implied, with respect to this software, its quality, accuracy, merchantability, or fitness for a particular purpose. This software is provided "AS IS", and you, its user, assume the entire risk as to its quality and accuracy. This software is copyright (C) 1999-2000, Mirar Hagland All Rights Reserved except as specified below. Permission is hereby granted to use, copy, modify, and distribute this software (or portions thereof) for any purpose, without fee, subject to these conditions: (1) If any part of the source code for this software is distributed, then this license must be included, with this copyright and no-warranty notice unaltered; and any additions, deletions, or changes to the original files must be clearly indicated in accompanying documentation. (2) If only executable code is distributed, then the accompanying documentation must state that "this software is utilizing the Mird database". (3) Permission for use of this software is granted only if the user accepts full responsibility for any undesirable consequences; the authors accept NO LIABILITY for damages of any kind. These conditions apply to any software derived from or based on the Mird database code, not just to the unmodified library. If you use this work, you ought to acknowledge it. I permit and encourage the use of this software as the basis of commercial products, provided that all warranty or liability claims are assumed by the product vendor. The Unix configuration script "configure" was produced with GNU Autoconf. It is copyright by the Free Software Foundation but is freely distributable. The same holds for its supporting scripts (config.guess, config.sub, ltconfig, ltmain.sh). Another support script, install-sh, is copyright by M.I.T. but is also freely distributable. ¹ this legalese is stolen from the JPEG license Workflow Opening, syncing and closing the database Opening the database consists of two steps, first to create the database object structure, then to create, open or restore1 the database file. * initialize the structure, ie call mird_initialize, * change the parameters in the structure, see struct mird, * open the database, ie call mird_open. ... * operate on the database * while opened, the database and journal file may be synced by calling mird_sync or mird_sync_please. ... * close the database and free the structure by calling mird_close. Example: MIRD_RES res; struct mird *my_database; /* initialize the structure */ if ( (res=mird_initialize("my_database_file.mird",&my_database)) ) [...do something with the error...] /* change the default parameters if necessary */ my_database->flags|=MIRD_NOCREATE; my_database->cache_size=1024; /* open the database file */ if ( (res=mird_open(my_database)) ) [...do something with the error...] [...] /* upon need, */ /* sync the database and flush the journal file */ if ( (res=mird_sync(my_database)) ) [...do something with the error...] [...] /* close the database */ if ( (res=mird_close(my_database)) ) [...do something with the error...] Please note that you neither need to change or setup the database parameters, or sync the database. On a long-lived application (a server, for instance) the mird_sync operation may help to keep down the size of the files, though, since both the database and journal file only may grow up to the point where the database is synced or closed. At that point, the database starts to reuse any free space generated from the previous operations and the journal file is restarted. Also note that the close or sync operations shouldn't be too often. The operations to find free space are summarily quicker if there is more data to work on at the same time. 1) A database file can be in three states, non-existant, clean or dirty. It is dirty when a previous process has not closed the database nicely, but rather died in some way. In this case, the journal file is still there, and the database is restored to the lastest state possible. Transactions Any change to the database must be in the scope of a transaction. To change anything in the database, you must therefore first create a transaction, in which you do the change to the database, and then close it: * open the transaction, ie call mird_transaction_new ... * operate on the transaction, to do the changes ... * close or cancel the transaction, ie call mird_transaction_close or mird_transaction_cancel The result of a transaction close can be a conflict. In this case, the transaction must be cancelled. Example: struct mird_transaction *mtr; /* begin a new transaction */ if ( (res=mird_transaction_new(my_database,&mtr)) ) [...do something with the error...] [...operate on the transacton ...] /* close the transaction */ if ( (res=mird_transaction_close(mtr)) ) { if (!MIRD_IS_CONFLICT(res)) [...do something with the error...] [...do something with the conflict...] mird_free_error(res); if ( (res=mird_transaction_cancel(mtr)) ) [...do something with the error...] } Several transactions can be going on at the same time. If you doesn't have more then one transactions at the same time, you can never get conflicts. The transaction scope can be used to read the database, too. It keeps the scope the database had until it's closed (or synced against the current state in any other way). Note that the database cannot be synced when there is ongoing transactions. Writing to the database Within a transaction, changes to the database is possible. The database is table-oriented, so the first thing you will have to do with a new database, is to create the tables you need. A table may either be a mapping from an integer (32-bit) key to an any-length string, or from a string to a string. To create a table, call mird_key_new_table or mird_s_key_new_table: /* create table 17, integer key to string value table */ if ( (res=mird_key_new_table(mtr,17)) ) [...error...] /* create table 18, string key to string value table */ if ( (res=mird_s_key_new_table(mtr,18)) ) [...error...] Note that if the table pre-exists, this call fails and the resulting exception will be MIRD_TABLE_EXISTS. When the tables exists, you will want to write down key:value pairs in them. This is done by calling mird_key_store, or for string key tables, mird_s_key_store: /* store an integer key : string value in the table 17 */ mird_key_t int_key; unsigned char *value; mird_size_t value_len; if ( (res=mird_key_store(mtr,17,int_key,value,value_len)) ) [...error...] /* store a string key : string value in the table 18 */ unsigned char *string_key; mird_size_t key_len; unsigned char *value; mird_size_t value_len; if ( (res=mird_s_key_store(mtr,18,string_key,key_len,value,value_len )) ) [...error...] Any key can be written any number of times in a transaction. Conflicts appear at first when the transaction is closed. Reading from the database Reading from the database is done in a similar way as writing to the database, by calling mird_key_lookup or mird_s_key_lookup, for integer and string keys, respectively. When the data read is no longer used, it must be freed by calling mird_free. unsigned char *value; mird_size_t value_len; /* read from an integer key table (table 17) */ if ( (res=mird_key_lookup(my_database,17,int_key,&value,&value_len)) ) [...error...] if (!value) [...action if there was no value from that key...] /* free the resulting value */ if (value) mird_free(value); [...] /* read from a string key table (table 18) */ if ( (res=mird_s_key_lookup(my_database,18,string_key,key_len,&value ,&value_len)) ) [...error...] if (!value) [...action if there was no value from that key...] /* free the resulting value */ if (value) mird_free(value); Note that if the key doesn't exist in the table, the resulting value pointer will be NULL. If the table does not exist, or is the wrong type, this will result in the MIRDE_NO_TABLE respectively MIRDE_WRONG_TABLE exceptions. Note also that mird_free may not fail. If you need to read from the database in one and the same state, you can create a transaction to read in. This will keep the same state as the database were when the transaction was created, and may be read from by using mird_transaction_key_lookup and mird_transaction_s_key_lookup. Reading a whole table Errors and exceptions On some occations, the database functions will return an exception. If the return value is NULL, everything went fine. Otherwise, you may wish to handle this error in one way of other. To do this correctly, you may look at the elements in the strucure, and/or use the functions and macros available to examine an error; mird_describe_error, mird_perror and the convinience macro MIRD_IS_CONFLICT. The returned type MIRD_RES is a pointer to a structure containing the following elements: int error_no; char *s; unsigned x,y,z; error_no contains the error number the exception had, see the error list in the reference manual. s, x, y and z contains different information depending on the errors nature (from filenames and block numbers to table identifier and keys in conflicts). This is also described in the reference manual. mird_describe_error and mird_perror converts the error to something a little bit more human-readable. An error handling situation can look like this: /* do something */ if ( (res=mird_operation([...])) ) { /* check if it is something we expected */ if (res->error_no==MIRD_NO_TABLE) { [ ... do something about that ... ] mird_free_error(res); } else { /* it isn't something we expected, */ /* print the error and quit immediately */ mird_perror("my_program (doing some operation)",res); exit(1); } } If you get an exception, but continues to run the application, you must always free the exception. This is done by calling mird_free_error (which may not fail). Threads The database library is not thread safe. Almost any operation on the database must lock the whole database, so this is the solution I recommend to use in a threaded application. The library is thread safe if you avoid working on the same database structure, so you can have more then one database open at the same time and open on both of them. (There is no global variables in the library itself.) Implementation This chapter is for you out there that really wants to know how this library works, and is by all means neccesary to use the library - although it could be enlightening. The database engine is divided up in several distinct areas, from low-level storage to handling transactions and tables: +---------------------------------------------------+ | low-level block storage | |------------------------------------------|--------| | block cache system | | | |------------| | | | free block | super | |-------------------------| |-|----| list | blocks | | frag block system | | | | | |------------------|------| |------|--| |--| | | hashtries | data cells | | | | |--------------| |------------------|------------------| |--| | | journal file | | tables | | | | | | | |----------------------------------| |----| |------|--------| | | | transactions | | | | |--|-----------------------------------------|-|---------------|-----| | database | |--------------------------------------------------------------------| Low-level blocks, superblocks and the free block list The database is based on blocks. A block can be in four states, it can be * database system block, ie, a free list block or a superblock * used block, either a full cell block or a frag block * unused block * block beyond the tail of the database file All blocks are of constant size, per default 2048 bytes. This makes the database less sensitive to fragmentation. The superblocks contain information of the database, like where to find information about the tables, what the block size are, and other various neccesary knowledge to read a database. Every 4^n-1 block is a superblock (block 0, block 3, block 15, ...). They also contain information to tell if the database is clean or in dirty mode - the later means that the journal file must be read back at database open. The free list blocks contains lists of free blocks, that can be reused. The superblock has information on how big the file is, so if the datpabase runs out of blocks to reuse, it makes the file bigger instead. The free list is never appended, until the database is closed or synced. No block can be reused until the database is synchronized with the disk. Unfortunately, this makes the database grow up to this point - but beyond this point, the unused space is reused. Block cache system Directly below the low-level block storage, there is a block cache system; various other levels use this system as both read- and write cache. The cache system is a closed hash table of blocks, that has been read to memory. When a block is needed, it first checks the cache for the block, and if it isn't in the cache, it disposes of a block in the cache, and reuses it's space. What block is disposed of is random. When a block is written to, it is also read to the cache, or allocated in the cache if it's a new block, the cache data pointer is returned, and the block is marked as dirty. A dirty block that has to be reused in the cache is saved to disk before reuse. Later, when a transaction is completed, the transactions blocks still marked dirty in the cache is flushed to disk. Fragblocks For small segments of data, a complete block isn't used. More then one small piece of data can be stored in a special block, called fragblock. The fragblock has a table of contents at the start. Any piece of data has an address to either a fragblock or a block of it's own. Some bits are used to address the block itself, and a few bits are used to address the data in a a fragblock, per default 27 respective 5 bits. If the later bits are zero, the address is a block of its own, but any other value is pointing to an address in a frag block. This gives 2^n-1 possible data pieces in a frag block, per default 2^5-1 = 31 pieces. Data cells A piece of data can either fit in a fragblock or a big block, or not fit at all. In this case, the data cell takes up more then one block (or one or some block and a piece of a fragblock). These blocks is ordered in a linked list. The allocation system is supposed to make these blocks allocated in the right order, if possible, so that readback will be forward, allowing read-cache optimization in the operative system to work more efficient. The hashtrie The integer-key lookup table is a heavily branching tree - I think the data structures is called hashtrie. The first node branches off on the lowest bits of the hash key, the next node branches off on the second lowest bits, etc. In the end, the leaf is always a data cell. The cell itself contains the hash key it's supposed to map from, so a full tree is never neccesary. top node | +----+----+-- ... --+ bits value 0 1 2 n-1 | | | | node node node node | +----+----+-- ... --+ 0 1 2 n-1 | | | | node node node node | leaf with data cell (in this example the key has 2*n+1 as the lowest bits) This structure makes it possible to have a huge hashtable (or integer key table, as it is in reality), without having to rewrite the whole table for any minor change. In a write to the table, the database only writes down the nodes that has change, and the new nodes neccesary to branch the tree. It also makes a rather fast way of reading back the keys, which will be in logarithmic time. Per default there is 32 branches per node, which gives a write or readback time of approximately c + d·log32 n, where n is the number of keys in the table, and c and d is constant. This is efficient enough in about any case. Example: We want to find the key 4711 in a database that is using 5 bits per hashtrie node (32 branches). * First, we find the top node. It's a hashtrie, so we figure out what branch we will take here. 4711&(2^5-1) is 7, so that is the branch we will take. * This branches off to another node, so we will have to figure out what branch we will take again. Since we worked off 5 bits already, this time the branch number will be (4711>>5)&(2^5-1), 147&31, branch number 19. * That branch points to a data cell, so we start reading this cell. It tells us the key that the cell contains is 4711, which we were looking for, so we have found our data. If the cell key had been different to 4711, or any branch didn't lead anywhere, we would know our key wasn't in this table. Writing the key is a bit more obstacled, since we can't use any other transactions already existing nodes if we have to write in them. (See below.) For each node we have to change, we have to allocate some data space and copy the old node there, before writing to it - if we didn't create or write in the node already. This of course makes the node directly above this node affected to change, so the whole chain back to the top node has to be rewritten. This again makes it possible to control the whole database by knowing just the top node, and that makes it possible to cancel a transaction, which is neccesary for atomicity and isolation between transactions, which is quite useful. String key tables String key tables, which can be quite more useful then the normal integer key or hash key table, is handled through a special case of this hashtrie. The string key is converted to a hash key, which is used in a normal hashtrie table, but the key is stored in the data cell, and it is allowed to keep more then one key:value pair in the data cell. This is neccesary; of course more then one string can be converted to the same hash key (there are more then 2^32 strings possible, the string can be any length), but not very often - that is the main function of the string-to-hash-key conversion, so the database is still very efficient. Master table The database can contain more then one of these tables, and that is controlled by another special case of the above described hashtrie, called the master table. The only difference in this table is that the leafs of this table are not data, but table descriptors. The table descriptor tells us where the root of the table are, and what kind of table it is. Journal file and transactions Conflicts Freeing blocks Cleaning dirty databases Author The author is Mirar. I can be contacted either on or at . I'm a nice guy, please praise me for my work and report bugs so the database can be even more stable. Mirds homepage is http://www.mirar.org/mird/, where any new release can be downloaded.