Lemur Indexing Applications
Contents
1. BuildIndex
This application builds a KeyfileIncIndex, or IndriIndex for a collection of documents.
To use it, follow the general steps of running a lemur application.
The parameters are:
- index: name of the index table-of-content file without the extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
- indexType: the type of the index you want to build
- key for KeyfileIncIndex (.key)
- indri for IndriIndex (.ind)
- memory: memory (in bytes) to pre-allocate (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list, currently not supported by IndriIndex. These acronyms will still be indexed in lowercase by IndriIndex.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
2. IndriBuildIndex
This application builds an Indri Repository for a collection of documents. The indri applications, IndriBuildIndex, IndriDaemon, and IndriRunQuery accept parameters from either the command line or from a file. The parameter file uses an XML format. The command line uses dotted path notation. The top level element in the parameters file is named parameters.Repository construction parameters
- memory
- an integer value specifying the number of bytes to use for the indexing process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as
-memory=100Mon the command line. - corpus
- a complex element containing parameters related to a corpus. This element can be specified multiple times. The parameters are
- path
- The pathname of the file or directory containing documents to index. Specified as <corpus><path>/path/to/file_or_directory</path></corpus> in the parameter file and as
-corpus.path=/path/to/file_or_directoryon the command line. - class
- The FileClassEnviroment of the file or directory containing documents to index. Specified as <corpus><class>trecweb</class></corpus> in the parameter file and as
-corpus.class=trecwebon the command line. The known classes are:- html -- web page data.
- trecweb -- TREC web format, eg terabyte track.
- trectext -- TREC format, eg TREC-3 onward.
- trecalt -- TREC format, eg TREC-3 onward, with only the TEXT field included.
- doc -- Microsoft Word format (windows platform only).
- ppt -- Microsoft Powerpoint format (windows platform only).
- pdf -- Adobe PDF format.
- txt -- Plain text format.
- annotations
- The pathname of the file containing offset annotations for the documents specified in
path. Specified as <corpus><annotations>/path/to/file</annotations></corpus> in the parameter file and as-corpus.annotations=/path/to/fileon the command line. - metadata
- The pathname of the file or directory containing offset metadata for the documents specified in
path. Specified as <corpus><metadata>/path/to/file</metadata></corpus> in the parameter file and as-corpus.metadata=/path/to/fileon the command line.Combining the first two of these elements, the parameter file would contain:
<corpus>
<path>/path/to/file_or_directory</path>
<class>trecweb</class>
</corpus>
- metadata
- a complex element containing one or more entries specifying the metadata fields to index, eg title, headline. There are three options
-
field-- Make the named field available for retrieval as metadata. Specified as <metadata><field>fieldname</field></metadata> in the parameter file and asmetadata.field=fieldnameon the command line. -
forward-- Make the named field available for retrieval as metadata and build a lookup table to make retrieving the value more efficient. Specified as <metadata><forward>fieldname</forward></metadata> in the parameter file and asmetadata.forward=fieldnameon the command line. The external document id field "docno" is automatically added as a forward metadata field. -
backward-- Make the named field available for retrieval as metadata and build a lookup table for inverse lookup of documents based on the value of the field. Specified as <metadata><backward>fieldname</backward></metadata> in the parameter file and asmetadata.backward=fieldnameon the command line. The external document id field "docno" is automatically added as a backward metadata field.
-
- field
- a complex element specifying the fields to index as data, eg TITLE. This parameter can appear multiple times in a parameter file. If provided on the command line, only the first field specified will be indexed. The subelements are:
- name
- the field name, specified as <field><name>fieldname</name></field> in the parameter file and as
-field.name=fieldnameon the command line. - numeric
- the symbol
trueif the field contains numeric data, otherwise the symbolfalse, specified as <field><numeric>true</numeric></field> in the parameter file and as-field.numeric=trueon the command line. This is an optional parameter, defaulting to false. Note that0can be used for false and1can be used for true.
- stemmer
- a complex element specifying the stemming algorithm to use in the subelement name. Valid options are Porter or Krovetz (case insensitive). Specified as <stemmer><name>stemmername</name></stemmer> and as
-stemmer.name=stemmernameon the command line. This is an optional parameter with the default of no stemming. - stopper
- a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as
-stopper.word=stopwordon the command line. This is an optional parameter with the default of no stopping.
3. BuildDocMgr
BuildDocMgr builds a document manager. A DocumentManager is necessary for later retrieval the original documents in an index. Builds an inverted index simultaneously if an index name is provided.Summary of required parameters:
- manager:required name of the document manager (without extension)
- managerType:required name of the document manager type, one of flat (FlatfileDocMgr) bdm (KeyfileDocMgr) or elem (ElemDocMgr)
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- dataFiles: name of file containing list of names datafiles (one line per datafile name, use full path)
- index: name of the index table-of-content file without any extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
- indexType:the type of index to create. Currently only "key" (KeyfileIncIndex) is supported
- memory: memory (in bytes) to pre-allocate (def = 96000000).
- position: store position information (def = 1).
- stopwords: name of file containing the stopword list. Words in this file should be one per line. If this parameter is not specified, all words are indexed.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
4. BuildPropIndex
This application builds an index for a collection of documents with properties associated with terms.
Usage: BuildPropIndex paramfile [datfile1]* [datfile2]* ...
* data files can be specified on the command line OR in a metafile specified as the dataFiles parameter
The parameters are:
- index: name of the index to create (don't include extension)
- indexType:the type of index to create. Currently only "key" (KeyfileIncIndex) is supported
- memory: memory (in bytes) to pre-allocate (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- "brill" for documents with Brill's part of speech tags, still needs DOC separators between documents similar to Lemur's WebParser. This is the default.
- "identifinder" for documents with Identifinder's named entity tags, still needs DOC separators between documents similar to Lemur's WebParser.
- stemmer:
- "porter" Porter stemmer.
- "krovetz" Krovetz stemmer.
- "arabic" arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.



