The Lemur Toolkit

Features

Below is a summary listing of the features found within the Lemur Toolkit:

Sophisticated structured query languages (using InQuery and Indri)
Support for XML and structured document retrieval
Used commonly with a wide range of research test collections (e.g., TREC CDs 1-5, wt10g, RCV1, gov, gov2)
Index your web pages with an "out-of-the-box" site search capability
Interactive interfaces for Windows, Linux, and Web
Distributed information retrieval and document clustering applications
Cross-platform, fast and modular code written in C++
C++, Java and C# APIs
Free and open-source software
In use for over 6 years by a large and growing user community

Multiple indexing methods for small, medium and large-scale (terabyte) collections
Built-in support for English, Chinese and Arabic text
Porter and Krovetz word stemming
Incremental indexing
Out-of-the-box indexing support for TREC Text, TREC Web, plain text, HTML, XML, PDF, MBox, Microsoft Word, and Microsoft PowerPoint
Indexes inline and offset text annotations (e.g., part-of-speech and named entities)
Indexes document attributes

Supports major language modeling approaches such as Indri and KL-divergence, as well as vector space, tf.idf, Okapi and InQuery
Relevance- and pseudo-relevance feedback
Wildcard term expansion (using Indri)
Passage and XML element retrieval
Cross-lingual retrieval
Smoothing via Dirichlet priors and Markov chains
Supports arbitrary document priors (e.g., Page Rank, URL depth)