Lemur API User's Guide
Contents
- Introduction
- Some Useful Classes
- Utility level
- Indexer level
- Retrieval level
- References
1. Introduction
The Lemur Application Program Interface (API) is intended to allow a programmer to use the toolkit for special-purpose applications that are not implemented in the toolkit itself. The API provides interfaces to Lemur classes that are grouped at three different levels:
- Utility Level
This includes some common utilities such as memory management, a default exception handler, and a program argument handler. These utilities may be useful even if your program is not doing language modeling or text retrieval. There are also several classes that support flexible document parsing. The main class is TextHandler which allows for the chaining or pipelining of common parser components. See Parsing in Lemur for more details. The abstract class DocStream makes it easy to incorporate a user-defined document stream handler into the Lemur indexer, thus supporting different document formats. A basic document stream handler that recognizes a pre-defined document format is also included.
- Indexer Level
An indexer converts a raw text collection to efficient data structures, so that the information (e.g., word counts) may be accessed conveniently and efficiently later. This indexer level depends on the utility level mentioned above. It provides a "push" method where the application is responsible for sending the appropriate information to an indexer through the PushIndex interface. The indexer level also contains an abstract Index interface for the access of indexed information, two implementations of this interface, and various kinds of indexing support, such as the common data structures used by Index: TermInfo, TermInfoList, DocInfo, and DocInfoList. This level supports not just retrieval but also any type of application that requires efficient access to the frequency information of words in a large text collection.
- Retrieval Level
This level has abstract classes for a general retrieval architecture and concrete classes for several specific information retrieval methods. Classes at this level are most useful for users who want to build a prototype system or an evaluation system. They are also useful for users who want to extend these algorithms, e.g., by adding a different smoothing algorithm or a completely new retrieval algorithm. This level depends on the abstract Index interface (regardless of the underlying index implementation)
2. Some Useful Classes
This section describes the design of some of the most useful classes in the toolkit. A complete list of all classes and their documentation can be found in the Lemur source documentation
- Utility level
- Indexer level
- Retrieval level
2.1 Document streams (utility level)
The classes DocStream, Document, DocumentProps, and TokenTerm are a group of abstract classes that specify an abstract interface for streaming a text collection. By subclassing these classes, a user may define a customized document stream handler.
The interface is expected to be used in the following way. First, iteration can be performed on DocStream to obtain each instance of Document sequentially. However, the call of the method nextDoc is expected to be followed by another iteration call startTermIteration before the next call of nextDoc can be performed. This is because the method nextDoc does NOT advance the file pointer to the beginning of the next document; it is when we iterate over terms (i.e., getting instances of TokenTerm) that the file position pointer gets advanced. Note that you can call startTermIteration multiple times to repeat reading the terms in a document, and you can also call skipToEnd to "fast" forward to the end of the current document without reading all the terms, which then would allow you to call nextDoc to get the next document. An instance of TokenTerm provides the spelling of a term and possibly other information. The obtained document instance also supplies information about document ID and possibly other information in DocumentProps.
The Lemur toolkit has an implementation of DocStream called BasicDocStream. It supports the following simple format:
<DOC docid1> term1 term2 ... termN </DOC> <DOC docid2> ... </DOC>
BasicDocStream can be used to access a document stream for any purpose (not just indexing).
Note that as long as the interface of DocStream is respected, the indexer will not care about where the terms or documents are actually from. This means, we can "fake" a collection of documents as one single document (e.g., by pooling all terms together), and this would allow us to support indexing of collections for distributed IR.
2.2 Text handlers (utility level)
The abstract class TextHandler allows for the chaining or pipelining of common parser components. It provides a general and flexible support for preprocessing documents. This interface is extended to provide a Parser, Stemmer, and Stopper. See Parsing in Lemur for more details.
2.3 Building an index (Indexer Level)
The "push" method of building an index is to use the PushIndex interface to add documents and terms; thus, the indexing process is actually controlled by the application, which allows more than one index to be built at the same time. The application can use any kind of text handling, such as the ones provided in the utility level before passing the tokens to the PushIndex. There is currently only one implemented PushIndex class: KeyfileIncIndex.
Here is an example of how the PushIndex interface can be used to build one regular index and another index with one very large document at the same time:
KeyfileIncIndex *regIndex = new KeyfileIncIndex(indexname, memorysize, startDocId); KeyfileIncIndex *onedocIndex = new KeyfileIncIndex(indexname, memorysize, secondStartDocId); ... onedocIndex->beginDoc(documentprops); while (moreTokens()) { switch (token) { case DOCID: // close out the previous if (!firstDocument()) { setEndDocumentProps(documentprops) regIndex->endDoc(documentprops); } // start a new document setNewDocumentProps(documentprops) regIndex->beginDoc(documentprops); break; case TERM: setTermInformation(invfpterm); regIndex->addTerm(invfpterm); onedocIndex->addTerm(invfpterm); break; } regIndex->endCollection(collectionprops); onedocIndex->endDoc(documentprops); onedocIndex->endCollection(collectionprops);
2.4 Access to index information (Indexer Level)
The most important Lemur abstract classes at the indexer level are Index, TermInfoList, DocInfoList, TermInfo, and DocInfo. These classes provide a uniform interface for access to an indexed document collection. They are usually the only classes that an application program needs to deal with when accessing an index. The functions supported by Index can be grouped in the following way:
- Open.
An indexer must be opened before any access function can work. The open function takes a single argument: the name of the table of contents (toc) file that is created by an indexer.
- Spelling and integer-encoding translations.
There are two lexicons maintained by any indexer. One is for terms and the other is for document identifiers. Each term has both a spelling form (i.e., a string) and an integer encoding (i.e., an index). Each document is also assumed to have a unique string identifier (ID), so that each document ID also has a spelling form and an integer encoding. The Index class has several functions that you can use to convert back and forth between the string form and the integer encoding. An unknown document ID or term is mapped to the zero integer encoding, which is allocated for any ID that is "out of vocabulary" (OOV). Thus, the effective encoding or index for terms or document IDs starts from 1 up to the total count of terms or document IDs in a collection.
- Summary counts.
There are various kinds of summary counts supported. The count of terms belong to two types: regular counts and unique counts. The function name indicates this difference (i.e., termCount and termCountUnique). The regular counts include every occurance of the token, while the unique counts ignores repeated occurances of the same token. The term count functions are overloaded functions with the parameter indicating the scope of counting. If an document ID is passed as the parameter, then the count of terms is restricted to the document with that ID. If a term ID is passed as the parameter, then the count of terms means the count of the passed term in the whole collection. Without any parameter, the count of terms will be on any term and on the whole collection. The counts supported include:
- Total number of documents (method docCount())
- Total number of different/unique terms in the collection, i.e., the vocabulary size (method termCountUnique())
- Total number of terms in a document (i.e., document length) (method docLength(int docID))
- Total number of documents with a particular term (i.e., document frequency , e.g., used in the IDF formula) (method docCount(int termID))
- Total count of occurrences of a term in the collection (method termCount(int termID))
- Total count of all terms in the whole collection (method termCount())
- Average number of terms in a document (i.e., average document length) (method docLengthAvg())
- Index access.
The basic information that an indexer stores is a document by term matrix with each entry denoting the frequency count of a term in a document. There are two ways to access this matrix: by document or by term. That is, with document ID as the key, you can obtain a list of counts (called a TermInfoList in Lemur), each corresponding to a term occurring in the document. Alternatively, you can use a term ID as the key to obtain a list of counts (called a DocInfoList in Lemur), each corresponding to the count of the term in each document in which it occurs. An index that supports the second way is often referred to as an inverted index.
These two ways of access are supported by two functions on Index: termInfoList and docInfoList. Function termInfoList returns an instance of TermInfoList which supports iterations over all the counts along with their corresponding term ID. Similarly, docInfoList returns an instance of DocInfoList, which supports iterations over all the counts along with the corresponding document ID. The instances returned by functions termInfoList and docInfoList must be deleted later. However, when iterating over entries of either a DocInfoList or a TermInfoList, you will get a pointer to a static instance in the indexer, so there is no need to, and you must not, delete the instance for each entry. (See the example below.)
- Original document source.
The Index method docManager(int docID) returns the document manager object that is associated with this docID. The DocumentManager API should then be used to obtain and parse the actual document. One example implementation of a DocumentManager is FlattextDocMgr. The docManager(int docID) method may not be implemented for all indexes, in which case it returns an invalid pointer.
BuildIndex builds a KeyfileIncIndex or a LemurIndriIndex . Usually, an application programmer would only need to construct an instance of the indexer that is actually used, and then will access information through the abstract interface Index. The class IndexManager has an openIndex method that can create the proper index using its file suffix.
The following is an example of how to use the index interface:
Index *ind; ind = IndexManager::openIndex("index-toc-file.key"); // open the index specified by the table-of-content (toc) file "index-toc-file.key" // IndexManager recognizes the suffix of the toc file (in this case ".key") and uses it to // infer the actual type of the index to be opened (KeyfileIncIndex). int t1; ... // now fetch the doc info list for term t1 DocInfoList *dList = ind->docInfoList(t1); dList->startIteration(); while (dList->hasMore()) { DocInfo * entry = dList->nextEntry(); // obtains a pointer to a static instance cout << "entry doc id: " << entry->docID() << endl; cout << "entry term count: " << entry->termCount() << endl; // note that you MUST NOT delete entry! } delete dList; // you MUST delete dList! // TermInfoList can be accessed in exactly the same way delete ind; // The IndexManager creates a dynamic instance.
2.5 Counts and language models (Indexer Level)
The classes UnigramLM, Counter, DocUnigramCounter, and SmoothedMLEstimator provide support for building language models on top of an Index
The class UnigramLM is an abstract representation of a unigram language model. It supports looking up a probability for a given word and iterating over all the non-zero probabilities. Since each word is referred to by its integer index, a UnigramLM is also required to hold a lexicon ID that provides an interpretation of a word index. Typically, this is the file name of the term lexicon associated with an Index. This id is unused.
The class Counter is an abstract representation of a counter that provides an interface for access to the counts it holds. The concept of "count" is fairly general, and could be an integer, real, or user-defined value. Such generality can be useful for algorithms such as Expectation Maximization (EM), where the counts are fractional. The only implemented counter is based on an array (ArrayCounter). Since an array must be held in the memory, such a counter has a limit of how many entries can be stored, as well as a limit of how many such counters can be active at the same time. (This limitation can be broken by implementing a disk-based counter.) To facilitate language model construction based on an Index, we implemented a DocUnigramCounter which is a subclass of ArrayCounter. A DocUnigramCounter can be constructed in three different ways, corresponding to counting the words in a single document, in a set of documents, or in the entire collection.
The class SmoothedMLEstimator is a special UnigramLM that is estimated based on smoothing the maximum likelihood estimate. This is very common when we want to estimate a language model based on a piece of observed text (e.g., a document, or a set of documents). From this viewpoint, perhaps, an alternative (better?) name for this class would be TextUnigramLM. SmoothedMLEstimator holds a counter, which is assumed to hold the counts needed by the maximum likelihood estimator. The reason for the existence of this class is that several implementations use a smoothed maximum likelihood estimator based on a counter, and require iteration over non-zero probability entries. Without SmoothedMLEstimator, the implementation of such iterations would have to be duplicated in each of the specific smoothing methods. In addition to the maximum likelihood estimator, three specific smoothing methods are implemented: fixed coefficient linear smoothing, a Bayesian method using a Dirichlet prior, and the popular Laplace smoothing which is a special case of the Dirichlet method with a uniform prior.
The separation of the counter from the language model interface makes it easy to change the storage-level representation of counts without affecting the smoothing algorithm. For example, the Dirichlet prior smoothing method can take any concrete counter, as long as it conforms to the interface defined by the abstract class Counter. The general way of constructing a unigram language model is to first construct a counter and then use the counter to construct a (possibly smoothed) language model. For example, to construct a non-smoothed document language model, one could construct a DocUnigramCounter based on a document ID and an Index, and then construct a maximum likelihood unigram language model (MLUnigramLM) using the DocUnigramCounter as the counter.
2.6 General Retrieval Architecture (Retrieval Level)
Classes Query, QueryRep, and RetrievalMethod provide a very general, yet flexible support for different types of retrieval methods. Query represents any query, including a structured query. QueryRep represents any (internal) query representation determined by a specific retrieval method. It may also be used to hold any document-independent, query-specific values precomputed before scoring each document to make scoring process more efficient. The major abstract class for retrieval is RetrievalMethod which is on top of an Index and supports four methods:
- computeQueryRep returns a QueryRep for a Query.
- scoreDoc computes a score for one document with respect to the given QueryRep.
- scoreCollection scores all the documents in the collection represented by the Index.
- updateQuery modifies a QueryRep based on a set of document IDs, and serves as a general support for feedback.
It is reasonable to require any retrieval method to support a method for scoring one single document. Given that we have scoreDoc, it may seem redundant to have scoreCollection. However, scoring a collection based on scoreDoc would mean to score each document one by one, which is inefficient. Indeed, the scoring of the whole collection can often be much more efficient for many practical retrieval methods, if done on the inverted index. The existence of scoreCollection makes it possible for those retrieval methods to override it with a more efficient scoring strategy. For example, TextQueryRetMethod implements a generic scoring procedure based on the inverted index.
Now, one might also wonder why we also need scoreDoc now that we have scoreCollection. scoreDoc is needed for two main reasons. First, it may be impractical to score the whole collection with some complex retrieval algorithms, or you may only be interested in re-ranking some subset of documents. In that case, scoreCollection would be inappropriate, and being able to score a single document will be necessary. Second, sometimes, it may be desirable to compute the similarity between two documents using a retrieval function (e.g., treating one document as a query), and in that case, scoreDoc is also useful.
The support of feedback through modifying a QueryRep has three advantages:
- There is no need to distinguish pseudo feedback from relevance feedback; indeed the only difference is where the examples of relevant documents come from.
- It can potentially support many different types of non-traditional feedback. For example, the Markov chain query model has been implemented as a "feedback" procedure. (See Specific retrieval methods for text queries 2.8 .)
- It allows the possibility of reusing the QueryRep computed from the initial retrieval stage, so makes pseudo feedback efficient.
2.7 General retrieval methods for text queries (Retrieval Level)
In most cases, a query can be represented by a set of weighted terms, or a term vector. This is often true when the query is a piece of text (i.e., a text query). But, it may not be true for a Boolean query. When a query can be represented as a set of weighted terms, the retrieval function can often be computed efficiently via an inverted index. The abstract class TextQueryRetMethod is designed to support such retrieval functions. It inherits from RetrievalMethod and implements a generic inverted index scoring procedure, which involves iterating over all the query terms, for each query term, looking up all the entries in the inverted index to find the information about all the documents containing this term, and then, computing a sum of weights of each matched term for all the documents that match at least one query term. A ScoreAccumulator serves for accumulating the score for a document incrementally. A TextQueryRetMethod may use any ScoreAccumulator that is passed in through the constructor. This provides the flexibility of using different kinds of accumulators. For example, if you can afford to store an entry for each document in memory, then using the ArrayAccumulator may be appropriate. But, if you have a huge collection, you may want to use a ScoreAccumulator that sits on the disk.
A subclass of TextQueryRetMethod overrides appropriate functions to provide a specific term weighting strategy, a specific function for combining the weights, and any needed term-independent transformation or adjustment of the score.
Formally, given a query q =(q1,q2,...,qN) and a document d=(d1,d2,...,dN), where q1,...,qN and d1,...,dN are terms, TextQueryRetMethod assumes the following general scoring function:
That is, the score of a document d against a query q is a function g of the accumulated weight w for each matched term. The score is thus determined by two functions g and w; both may depend on the whole query or document. The function w gives the weight of each matched term, while the function g makes it possible to perform any further transformation of the sum of the weight of all matched terms based on the "summary" information of a query or a document (e.g., document length).s(q,d) = g(w(q1,d1,q,d) + ... + w(qN,dN,q,d),q,d)
Three abstract classes TextQueryRep, DocumentRep, and ScoreFunction are designed to support this general scoring function in the following way:
- A ScoreFunction is responsible for defining the two functions g and w.
- A TextQueryRep provides any information required for scoring from the query side (e.g., query term frequency). It also supports iteration over all query terms, allowing easy accumulation of weights over matched terms.
- A DocumentRep provides any information required for scoring from the document side.
The weighting function w and score adjustment function g typically assume and depend on some particular information and representation of the query and document, so a specific ScoreFunction (for a specific retrieval method) only works for some specific TextQueryRep and DocumentRep that are appropriate for the specific retrieval method. Such coordination is implemented by grouping compatible TextQueryRep, DocumentRep, and ScoreFunction methods together underneath a TextQueryRetMethod class. Different document/query representations are possible by subclassing a DocumentRep or TextQueryRep. Similarly, alternative scoring functions can be achieved by subclassing ScoreFunction.
The generic scoring procedure is implemented by the function scoreInvertedIndex in TextQueryRetMethod, and the procedure is sketched as follows:
- Initialize all entries in the document score accumulator
- Iterate over all terms in the TextQueryRep, and for each term, lookup and iterate over all the entries for the query term in the inverted index.
- For each entry, which represents information about a document that matches the query term
- Compute a DocumentRep for this document
- Compute the weight for this matched query term using the w function provided by the ScoreFunction, which is "helped" by the current TextQueryRep and DocumentRep.
- Add this weight to the score accumulator entry for the document corresponding to this inverted index entry.
- After finishing score accumulation over all query terms, adjust the score for each document by calling the g function provided by the ScoreFunction.
Normally, the documents that did not match any query term are ignored from scoring, but these documents do not necessarily have a zero score. It depends on the retrieval method, they may have a non-zero score. Although the scores of such documents are generally lower than any of the document that matches at least one query term, it is sometimes useful to have the actual scores of those documents. For this purpose, the function scoreInvertedIndex has an option to allow you to score "all" the documents, including the one that has not matched any query term.
The general scoring function given above looks more restrictive than it is. It may not be obvious, but many retrieval functions can be re-written in this form after some algebra transformation. A typical transformation is to convert the sum over the query terms that have not been matched by a document into the sum over all the query terms minus the sum over all the matched ones. A similar transformation can be done for a sum over the document terms that have not been matched by a query. As a result, the scoring formula often just involves the sum over all the matched terms and some other query-dependent or document-dependent constants.
TextQueryRetMethod also implements scoreDoc by iterating over query terms and checking each query term against a hash table created for the document to be scored.
2.8 Specific retrieval methods for text queries (Retrieval Level)
A specific inverted-index scorable retrieval method can be implemented by sub-classing TextQueryRetMethod. Since TextQueryRetMethod already implemented scoreDoc and scoreCollection, a specific retrieval method only needs to override the following four virtual functions and define its "own version" of TextQueryRep, DocumentRep, and ScoreFunction by subclassing these three abstract classes.
- computeTextQueryRep Given a TermQuery (a text query), generates a TextQueryRep instance.
- computeDocRep Given a document id, generates a DocumentRep instance.
- scoreFunc Returns a handle of the specific ScoreFunction
- updateQuery Given a subset of (feedback) documents and an original TextQueryRep, computes an updated TextQueryRep.
The instance returned by the computeTextQueryRep implemented in a specific retrieval method will actually be a specific TextQueryRep that depends on the specific TextQueryRetMethod class. The same is true for computeDocRep. For example, in the KL-divergence retrieval method, the actual DocumentRep is SimpleKLDocModel, which is, of course, a subclass of DocumentRep. In order to support different document model smoothing methods, we created different subclasses of SimpleKLDocModel (i.e., JelinekMercerDocModel, DirichletPriorDocModel, and AbosluteDiscoutDocModel).
In Version 1.1, three retrieval methods were implemented: TFIDFRetMethod, OkapiRetMethod, and SimpleKLRetMethod Each of these classes, along with its corresponding TextQueryRep, DocumentRep, and ScoreFunction, implements a different retrieval method. They are all subclasses of the abstract class TextQueryRetMethod. These classes are to be used in the following way:
- Construct an instance of it by passing in an Index, a ScoreAccumulator, and any other additional information required by the retrieval method (e.g., any pre-computed resources).
- Set the parameters appropriately. Each of these concrete retrieval method classes maintains its own parameters, of its own parameter type. Each also has a parameter-setting function that allows a user to set the parameters. If the parameters are not set, the default parameter values will be used.
- Call the appropriate scoring function (scoreDoc or
) on the constructed instance.
The feedback method implemented in TFIDFRetMethod is a modified and simplified Rocchio. It basically constructs the centroid vector of the set of relevant documents (or the top-N documents in the case of pseudo feedback), and adds the vector to the original query vector with some weighting coefficient. It does not use any non-relevant (negative) documents. See details of the implementation of the Lemur TFIDF for more information.
OkapiRetMethod is intended to be an exact implementation of the BM25 retrieval function as described in [1]. However, the feedback part as implemented appears to have some problems, since it does not give reasonable feedback performance. The BM25 retrieval formula has a query TF for each term. In case the term is a new term extracted from the feedback documents, it does not have a natural "query TF". It is unclear from the original paper how this is set, so we implemented it as an additional parameter.
SimpleKLRetMethod implements the KL-divergence retrieval model, which is an extension of the query-likelihood approach [4]. It essentially scores a document by computing the KL-divergence between the query language model and the document language model. Since we are only interested in ranking documents, we rewrote the KL-divergence formula and only computes the part that affects ranking, which is essentially the cross entropy of the query model with respect to the document model.
2.9 Document and query models (Retrieval Level)
The two classes SimpleKLDocModel and SimpleKLQueryModel are the abstract interface for the document language model and query language model to be used in a KL-divergence retrieval model. This means that to add a new query or document model estimation method, one generally creates a subclass of these abstract classes or provides a way to construct an instance of these classes. For example, JelinekMercerDocModel, DirichletPriorDocModel, and AbosluteDiscoutDocModel are three subclasses of SimpleKLDocModel, corresponding to three different document model smoothing methods evaluated in [2]. (The backoff version is also implemented.)
The SimpleKLDocModel class assumes the following general smoothing strategy:
p(w|d) = pseen(w|d) if w is seen in document dwhere pseen(w|d) gives the discounted maximum likelihood estimate of p(w|d), p(w|C) is the collection language model, and a(d) is a document-dependent constant that controls the amount of smoothing. Given a collection language model (computed based on an Index), a smoothing method is essentially characterized by pseenand the backoff weight a(d). Accordingly, SimpleKLDocModel defines an interface for them.
p(w|d) = a(d) p(w|C) otherwise
SimpleKLQueryModel supports iteration over non-zero probabilities of a unigram query language model for scoring. It also supports query model updating through a method that allows linear interpolation of the existing query model with any given unigram language model (see below). Thus, it is possible to estimate a (feedback) unigram language model in any way you like and then evaluate its retrieval effectiveness by using this interpolation function to update an existing query model. A special case is to interpolate with a zero weight for the original model, which will then be equivalent to using the estimated feedback model alone.
2.10 Feedback as Query Language Model Updating (Retrieval Level)
With the KL-divergence retrieval formula, it is possible to view feedback as re-estimating or updating a query language model. SimpleKLRetMethod implemented three different methods for estimating a query model based on a subset of (feedback) documents, all by interpolating the newly estimated model with the original query model with some empirically set coefficient. The three feedback methods are the collection mixture model method, the divergence minimization method, and the Markov chain translation model. The first two are proposed and evaluated in [3] and the third in [4]. The three methods are implemented as three functions in SimpleKLRetMethod. A special parameter controls which method to use when the function updateQuery is called. A new feedback method can be easily added by subclassing a RetrievalMethod and overrides the method updateQuery with the new method.
4. References
1. Robertson, S. et al. Okapi at TREC-3. Proceedings of the 3rd Text Retrieval Conference, 1994, pp 109-126.
2. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval, In 24th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), 2001.
3. C. Zhai and J. Lafferty, Model-based feedback in the language modeling approach to information retrieval, Tenth International Conference on Information and Knowledge Management (CIKM 2001), 2001.
4. J. Lafferty and C. Zhai. Risk minimization and language modeling in information retrieval. In 24th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), 2001.