Lemur:

Main Page | Namespace List | Class Hierarchy | Class List | File List | Namespace Members | Class Members | File Members | Related Pages

This application builds an index for a collection of documents.

To use it, follow the general steps of running a lemur application.

The parameters are:

index: name of the index table-of-content file without the extension.
indexType: the type of index, key (KeyfileIncIndex), indri (LemurIndriIndex)
memory: memory (in bytes) of KeyfileIncIndex cache (def = 128000000).
stopwords: name of file containing the stopword list.
acronyms: name of file containing the acronym list.
countStopWords: If true, count stopwords in document length.
docFormat:
- "trec" for standard TREC formatted documents
- "web" for web TREC formatted documents
- "chinese" for segmented Chinese text (TREC format, GB encoding)
- "chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
- "arabic" for Arabic text (TREC format, Windows CP1256 encoding)
stemmer:
- "porter" Porter stemmer.
- "krovetz" Krovetz stemmer,
- "arabic" arabic stemmer, requires additional parameters
  1. arabicStemFunc: Which stemming algorithm to apply, one of:
    - arabic_stop : arabic_stop
    - arabic_norm2 : table normalization
    - arabic_norm2_stop : table normalization with stopping
    - arabic_light10 : light9 plus ll prefix
    - arabic_light10_stop : light10 and remove stop words
dataFiles: name of file containing list of datafiles to index.

Generated on Tue Jun 15 11:02:58 2010 for Lemur by

doxygen

1.3.4