Lemur Beginner's Guide to Indexing
Contents
- What is an index?
- What kind of data/documents can Lemur index?
- Do the parsers add all words into the index?
- What type of indexes does Lemur have?
1. What is an index?
An index, or database, is basically a collection of information that can be quickly accessed, using some piece of information as a point of reference or key (what it's indexed by). In our case, we index information about the terms in a collection of documents, which you can access later using either a term or a document as the reference.
Specificly, we can collect term frequency, term position, and document length statistics because those are most commonly needed for information retrieval. For example, from the index, you can find out how many times a certain term occurred in the collection of documents, or how many times it occurred in just one specific document. Retrieval algorthms that decide which documents to return for a given query use the collected information in the index in their scoring calculations.
2. What kind of data/documents can Lemur index?
Actually, you can create your own parsers for whatever text documents you have, as long as your parser takes whatever it wants to recognize as a term and "pushes" it into the index. However, we do provide several parsers with the toolkit.
Lemur is primarily a research system so the included parsers were designed to facilitate indexing many documents that are in the same file. In order for the index to know where the document boundaries are within files, each document must have begin document and end document tags. These tags are similar to HTML or XML tags and are actually the format for NIST's Text REtrieval Conference (TREC) documents.
The 2 most frequently used parsers are the TrecParser and WebParser.
TrecParser: This parser recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. For example:
<DOC>
<DOCNO> document_number </DOCNO>
<TEXT>
Index this document text.
</TEXT>
</DOC>
WebParser: This parser removes HTML tags, text within SCRIPT tags, as well as text in HTML comments. Document boundaries are specified with NIST style format:
<DOC>
<DOCNO> document_number </DOCNO>
Document text here could be in HTML.
</DOC>
In addition to these parsers, Lemur also provides parsers for Chinese (GB2312 encoding) and Arabic (CP1256 encoding). (See "Parsing in Lemur" for more information.)
If your documents are not from NIST, these are the methods you can take to parse and index your documents:
- Write a script to add the NIST style tags around your documents. Then use one of the parsers provided by Lemur with either your own or one of Lemur's applications.
- Write your own parser and feed the terms into an index by using the PushIndex API in your own application.
- Implement your own TextHandler class (a parser to handle your document formats), which you can then use in pipeline fashion with other TextHandlers already in Lemur to further pre-process terms (i.e. stopping,stemming) and use with InvFPTextHandler to build an index. (See "Parsing in Lemur" for more information.)
Once your documents are ready to be indexed, you can index them using the BuildIndex or IndriBuildIndex applications. For the various parameters to be passed into the build index applications, see the API documentation.
3. Do the parsers add all words into the index?
After the initial parsing of a document into terms, there might be other considerations to be made before adding the term into the index, such as whether or not that word is important enough to add, whether to add the word as is or to index its stem form instead, and whether to recognize certain words as acronyms. Having an acronyms list, ignoring stopwords (very common words, like "the", "and", "it"), and indexing word stems (so "stem", "stemming", and "stems" would all become the same term) are features supported by Lemur. These features are all supported by the provided application, BuildIndex.
4. What type of indexes does Lemur have?
Lemur currently has two available index types: KeyfileIncIndex, and IndriIndex. The indexes are different in that they might index different data or represent the data differently on disk. Each index has a "table of contents" file which has some summary statistics on what's in the index as well as which files are needed to load the index. When you want to use an index, you will need its table of contents file to load it. A Keyfile index has the extention ".key" for its table of contents file. IndriIndex types have no single file to load as the table of contents, but rather, the root directory of the indri index is used as the input path to open the index.
Index Name | Extension | File Limit | Stores positions? | Loads fast? | Disk space usage | Applications | Add documents to Index* |
---|---|---|---|---|---|---|---|
KeyfileIncIndex | .key | no | yes | yes | Average | BuildIndex | yes, use BuildIndex |
IndriIndex | no | yes | yes | most (automatically stores compressed version of original documents) | BuildIndex or IndriBuildIndex | yes |