Indri Query Language Quick Reference


Introduction | Grammar | Terms/Proximity | Combining Beliefs | Filter Operators | Numeric/Date Operators | Priors | Applications

INTRODUCTION

The Indri query language, based on the Inquery query language, was designed to be robust. It can handle both simple keyword queries and extremely complex queries. Such a query language sets Indri apart from many other available search engines. It allows complex phrase matching, synonyms, weighted expressions, Boolean filtering, numeric (and dated) fields, and the extensive use of document structure (fields), among others.

Although Indri handles unstructured documents, many of the query language features make use of structured (tagged) documents. Consider the following document:

<html>
<head>
<title>Department Descriptions</title>
</head>
<body>
The following list describes ...
<h1>Agriculture</h1> ...
<h1>Chemistry</h1> ...
<h1>Computer Science</h1> ...
<h1>Electrical Engineering</h1> ...
</body>
</html>

In Indri, a document is viewed as a sequence of text that may contain arbitrary tags. In the example above, the document consists of text marked up with HTML tags.

For each tag type T within a document (i.e. title, body, h1, etc), we define the context of T to be all of the text and tags that appear within tags of type T. In the example above, all of the text and tags appearing between <body> and </body> tags defines the body context. A single context is generated for each unique tag name. Therefore, a context defines a subdocument. Note that because of nested tags certain word occurrences may appear in many contexts. It is also the case that there may be nested contexts. For example, within the <body> context there is a nested <h1> context made up of all of the text and tags that appear within the body context and within <h1> and </h1> tags. Here are the tags for the title, h1, and body contexts:

title context:
<title>Department Descriptions</title>


h1 context:
<h1>Agriculture</h1>
<h1>Chemistry</h1> ...
<h1>Computer Science</h1> ...
<h1>Electrical Engineering</h1> ...


body context:
<body>
The following list describes ...
<h1>Agriculture</h1> ...
<h1>Chemistry</h1> ...
<h1>Computer Science</h1> ...
<h1>Electrical Engineering</h1> ...
</body>


Finally, each context is made up of one or more extents. An extent is a sequence of text that appear within a single begin/end tag pair of the same type as the context. For the example above, in the <h1> context, there are extents "<h1>agriculture</h1>", "<h1>chemistry<h1>", etc. Both the title and body contexts contain only a single extent because there is only a single pair of <title> ... </title> and <body> ... </body> tags, respectively. The number of extents for a given tag type T is determined by the number of sequences of the form: <T> text </T> that occur within the document.

The remainder of this document provides a broad overview of the language. For more specific details, see the Indri-related research papers and presentations.

QUERY LANGUAGE GRAMMAR

query   :=  ( beliefOp )+

beliefOp  :=  "#weight" ( extentRestrict )? weightedList
    | "#combine" ( extentRestrict )? unweightedList
    | "#or" ( extentRestrict )? unweightedList
    | "#not" ( extentRestrict )? '(' beliefOp ')'
    | "#wand" ( extentRestrict )? weightedList
    | "#wsum" ( extentRestrict )? weightedList
    | "#max" ( extentRestrict )? unweightedList
    | "#prior" '(' FIELD ')'
    | "#scoreifnot | #filrej" '(' unscoredTerm beliefOp ')'
    | "#scoreif | #filreq" '(' unscoredTerm beliefOp ')'
    | termOp ( '.' fieldList )? ( '.' '(' fieldList ')' )?

termOp    :=  ( "#od" POS_INTEGER | "#od" | '#' POS_INTEGER  ) '(' ( unscoredTerm )+ ')'
    | ( "#uw" POS_INTEGER | "#uw" ) '(' ( unscoredTerm )+ ')'
    | "#band" '(' ( unscoredTerm )+ ')'
    | "#datebefore" '(' date ')'
    | "#dateafter" '(' date ')'
    | "#datebetween" '(' date ' ' date ')'
    | "#dateequals" '(' date ')'
    | "<" ( unscoredTerm )+ ">"
    | "{" ( unscoredTerm )+ "}"
    | "#syn" '(' ( unscoredTerm )+ ')'
    | "#wsyn" '(' ( weight unscoredTerm )+ ')'
    | "#any" ':' TERM
    | "#any" '(' TERM ')'
    | "#less" '(' TERM integer ')'
    | "#greater" '(' TERM integer ')'
    | "#between" '(' TERM integer integer ')'
    | "#equals" '(' TERM integer ')'
    | "#base64" '(' ( "\t" | " " )* ( BASE64_CHAR )+ ( "\t" | " " )* ')'
    | "#base64quote" '(' ( '\t' | ' ' )* ( BASE64_CHAR )+ ( '\t' | ' ' )* ')'
    | '"' text '"'
    | "#wildcard" '(' TERM ')'
    | TEXT_TERM '*'
    | POS_INTEGER
    | POS_FLOAT
    | TERM

extentRestrict  :=  '[' "passage" POS_INTEGER ':' POS_INTEGER ']'
    | '[' FIELD ']'

weightedList  :=  '(' ( weight beliefOp )+ ')'

unweightedList  :=  '(' ( beliefOp )+ ')'

unscoredTerm  :=  termOp ( '.' fieldList )?

fieldList :=  FIELD ( ',' FIELD )*

date    :=  POS_INTEGER '/' TERM '/' POS_INTEGER
    | POS_INTEGER TERM POS_INTEGER
    | TERM

integer   :=  POS_INTEGER
    | NEG_INTEGER

weight    :=  POS_FLOAT
    | POS_INTEGER

TERM    :=  ( '0'..'9' )+ ('a'..'z' | 'A'..'Z' | '-' | '_')
    | TEXT_TERM

FIELD   :=  TEXT_TERM

TEXT_TERM :=  ( '\u0080'..'\u00ff' | ('a'..'z' | 'A'..'Z' | '0'..'9' | '-' | '_') )+

POS_INTEGER :=  ( '0'..'9' )+
NEG_INTEGER :=  '-' ( '0'..'9' )+
POS_FLOAT :=  ( '0'..'9' )+ '.' ( '0'..'9' )*
BASE64_CHAR :=  ('a'..'z' | 'A'..'Z' | '0'..'9' | '+' | '/')

TERMS / PROXIMITY

Terms are the basic building blocks of Indri queries. Terms come in the form of single term, ordered and unordered phrases, synonyms, among others. In addition, there are a number of options that allow you to specify if a term should appear within a certain field, or if it should be scored within a given context.

Terms:

Examples:

Proximity terms:

Examples:

Synonyms:

The first three expressions are equivalent. They each treat all of the expressions listed as synonyms. The #wsyn operator treats the terms as synonyms, but allows weights to be assigned to each term.

Examples:

NOTE: The arguments given to this operator can only be term/proximity expressions.

Wildcard Operations:

As of version 4.4, the Indri Query Language now supported wildcard terms in the form of suffix-only wildcard operations. To specify a wildcard, use the #wildcard operator or place an asterisk (*) at the end of term. Note that only suffix-based wildcards are available at this time - that is, a wildcard term must have at least one character at the beginning and the wildcard operator (*) must occur at the end.

Since the wildcard operator will create a synonym list of available terms, it is necessary for performance reasons to limit the number of terms generated for any given wildcard operator. The default maximum number of synonyms generated is 100 for every wildcard term. This can be overridden in the query parameters by the use of the <maxWildcardTerms> parameter. If the limit is breached, an exception will be thrown.

Examples:

"Any" operator:

Examples:

Field restriction / evaluation:

Examples:

COMBINING BELIEFS

Belief operators allow you to combine beliefs (scores) about terms, phrases, etc. There are both unweighted and weighted belief operators. With the weighted operators, you can assign varying weights to certain expressions. This allows you to control how much of an impact each expression within your query has on the final score.

Belief operators:

Examples: NOTE: If you are unsure which belief operator to use, it always "safest" to default to using the #combine or #weight operator. These operators are often the best choice for combining evidence. NEVER use #wsum or #wand unless you really know what you're doing!

Extent / Passage retrieval:

Example: The slides found here discuss more of the details and gives further examples.

Accessing children, parent and ancestor extents / passages:

Beginning with the Lemur Toolkit version 4.3.2 and Indri version 2.3.2, it is possible to reference parent and ancestor extents. Example: Note: if child, parent, or ancestor queries are slow, you may want to be certain to index the specified fields explicitly as an ordinal. This speeds things up at the cost of a minimal amount of disk space. In the example above ("title"), the following would be placed in the build index parameters:
  <field>
    <name>title</name>
    <ordinal>true</ordinal>
  </field>

FILTER OPERATORS

Filter operators allow you to score only a subset of an entire collection by filtering out those documents that actually get scored.

Filter operators:

Examples: NOTE: first argument must always be a term/proximity expression

NUMERIC / DATE FIELD OPERATORS

Numeric and date field operators provide a number of facilities for matching different criteria. These operators are very useful when used in combination with the filter operators.

General numeric operators:

Date operators:

Acceptable date formats: Examples: NOTE: The general numeric operators only work on indexed numeric fields, whereas the date operators are only applicable to a specially indexed numeric field named "date". See the indexing documentation for more on numeric fields.

DOCUMENT PRIORS

Document priors allow you impose a "prior probability" over the documents in a collection.

Prior

Example:

APPLICATIONS

Here we list suggested uses of the language for several common information retrieval tasks.

Ad Hoc Retrieval (Query Likelihood)

Ad hoc retrieval is the standard information retrieval task of finding documents that are topically relevant to a given information need (query). One common probabilistic approach to ad hoc retrieval is the query likelihood retrieval paradigm from language modeling. It is very simple to construct an Indri query that ranks documents the same as query likelihood. For the query, "literacy rates africa", we construct the following Indri query:

#combine( literacy rates africa )

This returns a ranked list that is exactly equivalent to the query likelihood ranking (under the given smoothing conditions).

Pseudo-Relevance Feedback / Query Expansion

Both pseudo-relevance feedback and query expansion methods typically begin with some intial query, do some processing, and then return a list of expansion terms. The original query is then augmented with the expansion terms and rerun.

Indri's pseudo-relevance feedback mechanism is an adaptation of Lavrenko's relevance models.

The following is a basic summary of the process:
  1. Retrieve documents using original query, which results in a ranked list ordered by P( I | D )
  2. Compute relevance model, P(r | I), over representation concepts (features) using top fbDocs documents from original ranked list
  3. Sort representation concepts by P(r | I) and keep top fbTerms
  4. Construct query Q_RM as: #weight( P(r_1 | I) r_1 ... P(r_fbTerms | I ) r_fbTerms )
  5. Construct expanded query as: #weight( fbOrigWeight Q 1-fbOrigWeight Q_RM )
  6. Retrieve documents based on expanded query

Named Page Finding / Homepage Finding

Named page finding and homepage finding are examples of known-item search. That is, the user knows some page exists, and is attempting to find it. One popular approach to known-item search is to use a mixture of context language models. This can easily be expressed in the Indri query language. For example, for the query "bbc news", the following query would be constructed:

#combine( #wsum( 5.0 bbc.(title) 3.0 bbc.(anchor) 1.0 bbc )
    #wsum( 5.0 news.(title) 3.0 news.(anchor) 1.0 news ) )

For each term in the query, the #wsum operator constructs a mixture model from the title, anchor, and whole document context language models and weights each model appropriately. The scores for the two terms are then #combined together.