Lemur Indexing Applications
Contents
1. BuildIndex
This application builds a KeyfileIncIndex, or IndriIndex for a collection of documents.
To use it, follow the general steps of running a lemur application.
The parameters are:
- index: name of the index table-of-content file without the extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
- indexType: the type of the index you want to build
- key for KeyfileIncIndex (.key)
- indri for IndriIndex (.ind)
- memory: memory (in bytes) to pre-allocate (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list, currently not supported by IndriIndex. These acronyms will still be indexed in lowercase by IndriIndex.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
2. IndriBuildIndex
This application builds an Indri Repository for a collection of documents. The indri applications, IndriBuildIndex, IndriDaemon, and IndriRunQuery accept parameters from either the command line or from a file. The parameter file uses an XML format. The command line uses dotted path notation. The top level element in the parameters file is named parameters.Repository construction parameters
- memory
- an integer value specifying the number of bytes to use for the indexing process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as
-memory=100M
on the command line. - corpus
- a complex element containing parameters related to a corpus. This element can be specified multiple times. The parameters are
- path
- The pathname of the file or directory containing documents to index. Specified as <corpus><path>/path/to/file_or_directory</path></corpus> in the parameter file and as
-corpus.path=/path/to/file_or_directory
on the command line. - class
- The FileClassEnviroment of the file or directory containing documents to index. Specified as <corpus><class>trecweb</class></corpus> in the parameter file and as
-corpus.class=trecweb
on the command line. The known classes are:- html -- web page data.
- trecweb -- TREC web format, eg terabyte track.
- trectext -- TREC format, eg TREC-3 onward.
- trecalt -- TREC format, eg TREC-3 onward, with only the TEXT field included.
- doc -- Microsoft Word format (windows platform only).
- ppt -- Microsoft Powerpoint format (windows platform only).
- pdf -- Adobe PDF format.
- txt -- Plain text format.
- annotations
- The pathname of the file containing offset annotations for the documents specified in
path
. Specified as <corpus><annotations>/path/to/file</annotations></corpus> in the parameter file and as-corpus.annotations=/path/to/file
on the command line. - metadata
- The pathname of the file or directory containing offset metadata for the documents specified in
path
. Specified as <corpus><metadata>/path/to/file</metadata></corpus> in the parameter file and as-corpus.metadata=/path/to/file
on the command line.Combining the first two of these elements, the parameter file would contain:
<corpus>
<path>/path/to/file_or_directory</path>
<class>trecweb</class>
</corpus>
- metadata
- a complex element containing one or more entries specifying the metadata fields to index, eg title, headline. There are three options
-
field
-- Make the named field available for retrieval as metadata. Specified as <metadata><field>fieldname</field></metadata> in the parameter file and asmetadata.field=fieldname
on the command line. -
forward
-- Make the named field available for retrieval as metadata and build a lookup table to make retrieving the value more efficient. Specified as <metadata><forward>fieldname</forward></metadata> in the parameter file and asmetadata.forward=fieldname
on the command line. The external document id field "docno" is automatically added as a forward metadata field. -
backward
-- Make the named field available for retrieval as metadata and build a lookup table for inverse lookup of documents based on the value of the field. Specified as <metadata><backward>fieldname</backward></metadata> in the parameter file and asmetadata.backward=fieldname
on the command line. The external document id field "docno" is automatically added as a backward metadata field.
-
- field
- a complex element specifying the fields to index as data, eg TITLE. This parameter can appear multiple times in a parameter file. If provided on the command line, only the first field specified will be indexed. The subelements are:
- name
- the field name, specified as <field><name>fieldname</name></field> in the parameter file and as
-field.name=fieldname
on the command line. - numeric
- the symbol
true
if the field contains numeric data, otherwise the symbolfalse
, specified as <field><numeric>true</numeric></field> in the parameter file and as-field.numeric=true
on the command line. This is an optional parameter, defaulting to false. Note that0
can be used for false and1
can be used for true.
- stemmer
- a complex element specifying the stemming algorithm to use in the subelement name. Valid options are Porter or Krovetz (case insensitive). Specified as <stemmer><name>stemmername</name></stemmer> and as
-stemmer.name=stemmername
on the command line. This is an optional parameter with the default of no stemming. - stopper
- a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as
-stopper.word=stopword
on the command line. This is an optional parameter with the default of no stopping. Here is Indri's standard stopword list in the IndriBuildIndex parameter file format.
3. BuildDocMgr
BuildDocMgr builds a document manager. A DocumentManager is necessary for later retrieval the original documents in an index. Builds an inverted index simultaneously if an index name is provided.Summary of required parameters:
- manager:required name of the document manager (without extension)
- managerType:required name of the document manager type, one of flat (FlatfileDocMgr) bdm (KeyfileDocMgr) or elem (ElemDocMgr)
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- dataFiles: name of file containing list of names datafiles (one line per datafile name, use full path)
- index: name of the index table-of-content file without any extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
- indexType:the type of index to create. Currently only "key" (KeyfileIncIndex) is supported
- memory: memory (in bytes) to pre-allocate (def = 96000000).
- position: store position information (def = 1).
- stopwords: name of file containing the stopword list. Words in this file should be one per line. If this parameter is not specified, all words are indexed.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
4. BuildPropIndex
This application builds an index for a collection of documents with properties associated with terms.
Usage: BuildPropIndex paramfile [datfile1]* [datfile2]* ...
* data files can be specified on the command line OR in a metafile specified as the dataFiles parameter
The parameters are:
- index: name of the index to create (don't include extension)
- indexType:the type of index to create. Currently only "key" (KeyfileIncIndex) is supported
- memory: memory (in bytes) to pre-allocate (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- "brill" for documents with Brill's part of speech tags, still needs DOC separators between documents similar to Lemur's WebParser. This is the default.
- "identifinder" for documents with Identifinder's named entity tags, still needs DOC separators between documents similar to Lemur's WebParser.
- stemmer:
- "porter" Porter stemmer.
- "krovetz" Krovetz stemmer.
- "arabic" arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.