Indri Parameter Files

The indri applications, IndriBuildIndex, IndriDaemon, and IndriRunQuery accept parameters from either the command line or from a file. The parameter file uses an XML format. The command line uses dotted path notation. The top level element in the parameters file is named parameters.

Repository construction parameters

memory

an integer value specifying the number of bytes to use for the indexing process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as -memory=100M on the command line.

corpus

a complex element containing parameters related to a corpus. This element can be specified multiple times. The parameters are

path

The pathname of the file or directory containing documents to index. Specified as <corpus><path>/path/to/file_or_directory</path></corpus> in the parameter file and as -corpus.path=/path/to/file_or_directory on the command line.

class

The FileClassEnviroment of the file or directory containing documents to index. Specified as <corpus><class>trecweb</class></corpus> in the parameter file and as -corpus.class=trecweb on the command line. The known classes are:

html -- web page data.
trecweb -- TREC web format, eg terabyte track.
trectext -- TREC format, eg TREC-3 onward.
trecalt -- TREC format, eg TREC-3 onward, with only the TEXT field included.
warc -- WARC (Web ARChive) format, such as is output by the Nutch webcrawler.
warcchar -- WARC (Web ARChive) format, such as is output by the Nutch webcrawler. Tokenizes individual characters, enabling indexing of unsgemented text.
doc -- Microsoft Word format (windows platform only).
ppt -- Microsoft Powerpoint format (windows platform only).
pdf -- Adobe PDF format.
txt -- Plain text format.

annotations

The pathname of the file containing offset annotations for the documents specified in path. Specified as <corpus><annotations>/path/to/file</annotations></corpus> in the parameter file and as -corpus.annotations=/path/to/file on the command line.

metadata

The pathname of the file or directory containing offset metadata for the documents specified in path. Specified as <corpus><metadata>/path/to/file</metadata></corpus> in the parameter file and as -corpus.metadata=/path/to/file on the command line.

Combining the first two of these elements, the parameter file would contain:
<corpus>
<path>/path/to/file_or_directory</path>
<class>trecweb</class>
</corpus>

metadata

a complex element containing one or more entries specifying the metadata fields to index, eg title, headline. There are three options

field -- Make the named field available for retrieval as metadata. Specified as <metadata><field>fieldname</field></metadata> in the parameter file and as metadata.field=fieldname on the command line.
forward -- Make the named field available for retrieval as metadata and build a lookup table to make retrieving the value more efficient. Specified as <metadata><forward>fieldname</forward></metadata> in the parameter file and as metadata.forward=fieldname on the command line. The external document id field "docno" is automatically added as a forward metadata field.
backward -- Make the named field available for retrieval as metadata and build a lookup table for inverse lookup of documents based on the value of the field. Specified as <metadata><backward>fieldname</backward></metadata> in the parameter file and as metadata.backward=fieldname on the command line. The external document id field "docno" is automatically added as a backward metadata field.

field

a complex element specifying the fields to index as data, eg TITLE. This parameter can appear multiple times in a parameter file. If provided on the command line, only the first field specified will be indexed. The subelements are:

name: the field name, specified as <field><name>fieldname</name></field> in the parameter file and as -field.name=fieldname on the command line.
numeric: the symbol true if the field contains numeric data, otherwise the symbol false, specified as <field><numeric>true</numeric></field> in the parameter file and as -field.numeric=true on the command line. This is an optional parameter, defaulting to false. Note that 0 can be used for false and 1 can be used for true.
parserName: the name of the parser to use to convert a numeric field to an unsigned integer value. The default is NumericFieldAnnotator. If numeric field data is provided via offset annotations, you should use the value OffsetAnnotationAnnotator. If the field contains a formatted date (see Date Fields) you should use the value DateFieldAnnotator.

stemmer

a complex element specifying the stemming algorithm to use in the subelement name. Valid options are Porter or Krovetz (case insensitive). Specified as <stemmer><name>stemmername</name></stemmer> and as -stemmer.name=stemmername on the command line. This is an optional parameter with the default of no stemming.

normalize

true to perform case normalization when indexing, false to index with mixed case. Default true

stopper

a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as -stopper.word=stopword on the command line. This is an optional parameter with the default of no stopping.

offsetannotationhint

An optional parameter to provide a hint to the indexer to speed up indexing of offset annotations when using offset annotation files as specified in the <corpus> parameter. Valid values here are "unordered" and "ordered". An "unordered" hint (the default) will inform the indexer that the document IDs of the annotations are not necessarily in the same order as the documents in the corpus. The indexer will adjust its internal memory allocations appropriately to pre-allocate enough memory before reading in the annotations file. If you are absolutely certain that the annotations in the offset annotation file are in the exact same order as the documents, then you can use the "ordered" hint. This will tell the indexer to not read in the entire file at once, but rather read in the offset annotations file as needed for only the annotations that are specified for the currently indexing document ID.

QueryEnvironment Parameters

Retrieval Parameters

index

path to an Indri Repository. Specified as <index>/path/to/repository</index> in the parameter file and as -index=/path/to/repository on the command line. This element can be specified multiple times to combine Repositories.

server

hostname of a host running an Indri server (IndriDaemon). Specified as <server>hostname</server> in the parameter file and as -server=hostname on the command line. The hostname can include an optional port number to connect to, using the form hostname:portnum. This element can be specified multiple times to combine servers.

count

an integer value specifying the maximum number of results to return for a given query. Specified as <count>number</count> in the parameter file and as -count=number on the command line.

query

An indri query language query to run. This element can be specified multiple times.

rule

specifies the smoothing rule (TermScoreFunction) to apply. Format of the rule is:

( key ":" value ) [ "," key ":" value ]*

Here's an example rule in command line format:

-rule=method:linear,collectionLambda:0.2,field:title

and in parameter file format:
<rule>method:linear,collectionLambda:0.2,field:title</rule>

This corresponds to Jelinek-Mercer smoothing with background lambda equal to 0.2, only for items in a title field.

If nothing is listed for a key, all values are assumed. So, a rule that does not specify a field matches all fields. This makes -rule=method:linear,collectionLambda:0.2 a valid rule.

Valid keys:

method: smoothing method (text)
field: field to apply this rule to
operator: type of item in query to apply to { term, window }

Valid methods:

dirichlet: (also 'd', 'dir') (default mu=2500)
jelinek-mercer: (also 'jm', 'linear') (default collectionLambda=0.4, documentLambda=0.0), collectionLambda is also known as just "lambda", either will work
twostage: (also 'two-stage', 'two') (default mu=2500, lambda=0.4)

If the rule doesn't parse correctly, the default is Dirichlet, mu=2500.

stopper

maxWildcardTerms

(optional) An integer specifying the maximum number of wildcard terms that can be generated for a synonym list for this query or set of queries. If this limit is reached for a wildcard term, an exception will be thrown. If this parameter is not specified, a default of 100 will be used.

Baseline (non-LM) retrieval

baseline

Specifies the baseline (non-language modeling) retrieval method to apply. This enables running baseline experiments on collections too large for the Lemur RetMethod API. When running a baseline experiment, the queries may not contain any indri query language operators, they must contain only terms.

Format of the parameter value:

(tfidf|okapi) [ "," key ":" value ]*

Here's an example rule in command line format:

-baseline=tfidf,k1:1.0,b:0.3

and in parameter file format:
<baseline>tfidf,k1:1.0,b:0.3</baseline>

Methods:

tfidf

Performs retrieval via tf.idf scoring as implemented in lemur::retrieval::TFIDFRetMethod using BM25TF term weighting. Pseudo-relevance feedback may be performed via the parameters below.

Parameters (optional):

k1: k1 parameter for term weight (default 1.2)
b: b parameter for term weight (default 0.75)

okapi

Performs retrieval via Okapi scoring as implemented in lemur::retrieval::OkapiRetMethod. Pseudo-relevance feedback may not be performed with this baseline method.

Parameters (optional):

k1: k1 parameter for term weight (default 1.2)
b: b parameter for term weight (default 0.75)
k3: k3 parameter for query term weight (default 7)

Formatting Parameters

queryOffset: an integer value specifying one less than the starting query number, eg 150 for TREC formatted output. Specified as <queryOffset>number</queryOffset> in the parameter file and as -queryOffset=number on the command line.
runID: a string specifying the id for a query run, used in TREC scorable output. Specified as <runID>someID</runID> in the parameter file and as -runID=someID on the command line.
trecFormat: the symbol true to produce TREC scorable output, otherwise the symbol false. Specified as <trecFormat>true</trecFormat> in the parameter file and as -trecFormat=true on the command line. Note that 0 can be used for false, and 1 can be used for true.

Pseudo-Relevance Feedback Parameters

fbDocs: an integer specifying the number of documents to use for feedback. Specified as <fbDocs>number</fbDocs> in the parameter file and as -fbDocs=number on the command line.
fbTerms: an integer specifying the number of terms to use for feedback. Specified as <fbTerms>number</fbTerms> in the parameter file and as -fbTerms=number on the command line.
fbMu: a floating point value specifying the value of mu to use for feedback. Specified as <fbMu>number</fbMu> in the parameter file and as -fbMu=number on the command line.
fbOrigWeight: a floating point value in the range [0.0..1.0] specifying the weight for the original query in the expanded query. Specified as <fbOrigWeight>number</fbOrigWeight> in the parameter file and as -fbOrigWeight=number on the command line.

IndriDaemon Parameters

memory: an integer value specifying the number of bytes to use for the query retrieval process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as -memory=100M on the command line.
index: path to the Indri Repository to act as server for. Specified as <index>/path/to/repository</index> in the parameter file and as -index=/path/to/repository on the command line.
port: an integer value specifying the port number to use.Specified as <port>number</port> in the parameter file and as -port=number on the command line.

Generated on Tue Jun 15 11:02:58 2010 for Lemur by

1.3.4