This application builds an index for a collection of documents.
To use it, follow the general steps of running a lemur application.
The parameters are:
-
index
: name of the index table-of-content file without the extension.
-
indexType
: the type of index, key (KeyfileIncIndex), indri (LemurIndriIndex)
-
memory
: memory (in bytes) of KeyfileIncIndex cache (def = 128000000).
-
stopwords
: name of file containing the stopword list.
-
acronyms
: name of file containing the acronym list.
-
countStopWords
: If true, count stopwords in document length.
-
docFormat
:
-
"trec" for standard TREC formatted documents
-
"web" for web TREC formatted documents
-
"chinese" for segmented Chinese text (TREC format, GB encoding)
-
"chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
-
"arabic" for Arabic text (TREC format, Windows CP1256 encoding)
-
stemmer
:
-
"porter" Porter stemmer.
-
"krovetz" Krovetz stemmer,
-
"arabic" arabic stemmer, requires additional parameters
-
arabicStemFunc
: Which stemming algorithm to apply, one of:
-
arabic_stop : arabic_stop
-
arabic_norm2 : table normalization
-
arabic_norm2_stop : table normalization with stopping
-
arabic_light10 : light9 plus ll prefix
-
arabic_light10_stop : light10 and remove stop words
-
dataFiles
: name of file containing list of datafiles to index.
Generated on Tue Jun 15 11:02:58 2010 for Lemur by
1.3.4