A BRIEF TUTORIAL ON USING ISEARCH --------------------------------- This document is work in progress (8/11/95). Send comments to Nassib Nassar (nrn@cnidr.org). Isearch is included as part of Isite; however, Isearch is updated more frequently than Isite, and it is a good idea to use the latest version of Isearch instead of the version that comes with Isite. The latest stable version of Isearch is available from ftp.cnidr.org in /pub/software/Isearch/. Binaries for some systems may be available in that same directory. Following are the usage details for the Iindex and Isearch utilities. You can view them at any time by simply typing `Iindex' and `Isearch' with no options. Iindex [-d (X)] // Use (X) as the root name for database files. [-a] // Add to existing database, instead of replacing it. [-m (X)] // Load (X) megabytes of data at a time for indexing // (default=1). [-s (X)] // Treat (X) as a separator for multiple documents within // a single file. [-t (X)] // Index as files of document type (X). [-f (X)] // Read list of file names to be indexed from file (X). [-r] // Recursively descend subdirectories. [-o (X)] // Document type specific option. (X) (Y) (...) // Index files (X), (Y), etc. Isearch [-d (X)] // Search database with root name (X). [-p (X)] // Present element set (X) with results. [-q] // Print results and exit immediately. [-and] // Perform boolean "and" on results. [-byterange] // Print the byte range of each document within // the file that contains it. [-o (X)] // Document type specific option. (X) (Y) (...) // Search for words (X), (Y), etc. // [fieldname/]searchterm[*][:n] // Prefix with fieldname/ for fielded searching. // Append * for right truncation. // Append :n for term weighting (default=1). // (Use negative values to lower rank.) SUBSCRIBING TO THE MAILING LIST There is a mailing list for discussion of Isite and Isearch. You can subscribe to the list by sending an e-mail message to listproc@kudzu.cnidr.org with the body of the message being subscribe ISITE-L FirstName LastName Be sure to use your name in place of `FirstName LastName'. To post messages to the list, send them to ISITE-L@kudzu.cnidr.org If you have any questions about this list, please contact nrn@cnidr.org. INSTALLING ISEARCH In order to compile Isearch, you must have Gnu C/C++ properly installed. Both gcc and the g++ library must be installed: URL:ftp://sunsite.unc.edu/pub/gnu/development/gcc-2.6.3.tar.gz URL:ftp://sunsite.unc.edu/pub/gnu/libraries/libg++-2.6.2.tar.gz To install Isearch, simply gunzip and untar, cd into the Isearch directory, and type make to compile. After this step (or if you downloaded precompiled binaries) you will probably want to install the executables somewhere. To install the executables in `/usr/local/bin/.', type (as root) make install Or, for example, to install into `/mylocal/bin/.', type make install INSTALL=/mylocal/bin Isearch should compile cleanly and without modification under most UNIX platforms. INDEXING AND SEARCHING A SIMPLE DATABASE (-d) The simplest way to use these utilities is to index and search a collection of text files (e.g., `*.txt') with the following commands: (1) Iindex -d /db/MYDB /data/*.txt (2) Isearch -d /db/MYDB computer In the first example Iindex will read all `*.txt' files in the /data/ directory and create several database files in the /db/ directory beginning with `MYDB.'. In the second example Isearch will search on those database files that Iindex had created and will display all documents that contain the word `computer'. Note that if you move or rename any of the text files you have indexed, you must then re-index (by running Iindex again). INDEXING AND SEARCHING FIELDED DATA (-t) Isearch supports field-based searching via `document types'. A document type is a self-contained C++ class that defines how to index a certain class of fielded documents. At present, the only document type distributed with Isearch is the SGMLTAG document type. SGMLTAG is able to parse SGML-like files such as HTML, and it treats tag pairs as field delimiters. For example, to index HTML files using the SGMLTAG document type, Iindex -d WEBPAGES -t SGMLTAG *.html The `-t' option tells Iindex to use the SGMLTAG document type, to treat the files it indexes as SGML-like tagged data. For example, to search the WEBPAGES database for the word `music' in the `title' field, Isearch -d WEBPAGES title/music Isearch will display all the indexed documents that contain the word `music' between `title' and `/title' tags. Note that field names and search terms are not case-sensitive, which means that all of the following would behave identically: Isearch -d WEBPAGES title/Music Isearch -d WEBPAGES Title/MUSIC Isearch -d WEBPAGES TITLE/music Document type names are also not case-sensitive; so that `-t sgmltag' is equivalent to `-t SGMLTAG'. INDEXING FILES CONTAINING MULTIPLE DOCUMENTS (-s) Iindex allows you to have many documents per file. Typically each document is separated by the others with a `separator string'. The separator might be a line of dashes (e.g., `-----') or any other string you choose. The `-s' option lets you tell Iindex what the separator string is in the files you are indexing. For example, Iindex -d ARTICLES -s "###" *.txt will treat all occurrences of the string `###' as a delimiter of two different documents within the same `.txt' file. For example, a file containing sample text 1 abcde ### sample text 2 fghij ### sample text 3 klmno would be interpreted by Iindex as three separate documents, the second two beginning with `###'. It is not always necessary to use the quotation marks when specifying the separator string after `-s', but it is a good habit, since some characters you might be using in the separator may be special commands to the system shell (e.g., `>'). SPEEDING UP INDEXING BY USING MORE MEMORY (-m) By default, Iindex reads and indexes 1 MB of text at a time. If the text exceeds that amount, Iindex must save parts of the index to disk and merge each part into the final index. If you have enough memory on your system, you can tell Iindex to read more than 1 MB of text at a time. For example, Iindex -d MYDB -m 3 *.txt will read and index the text in 3 MB blocks. The ideal case is to set the `-m' option to a value larger than the amount of text you have to index. (So, if you are indexing 1500 KB of text, use `-m 2'.) However, in most cases there is not enough memory to hold the entire collection. Iindex may actually use much more memory than what you specify with `-m', depending on a wide range of data-specific variables. The `-m' option specifies the amount of data to load, not the total amount of memory to use. INDEXING A LIST OF FILES (-f) Iindex lets you specify a list of files that you want to index. For example, Iindex -d MYDB -f filelist will expect the file `filelist' to be a text listing of file names that refer to the files you want to index. INDEXING FILES RECURSIVELY IN SUBDIRECTORIES (-r) Iindex can recursively descend subdirectories as it searches for files to index. For example, Iindex -d MYDB -r ~/doc/ will index all files located below the directory `~/doc/'. Note that when using `-r' the file specifications at the end of the line are taken to be directories that should be searched recursively. Specifying a mask with `-r', for example, Iindex -d WEBPAGES -t SGMLTAG -r /webdata/*.html (Incorrect!) is incorrect unless you actually have directory names ending in `.html'! The correct way to recursively index your web site is find /webdata/ -name "*.html" -print | Iindex -d WEBPAGES -t SGMLTAG -f -