The Lemur Project has the following components and sub-projects. Click
on the name to find out more about each one.
Indri is a search engine that provides state-of-the-art text search
and a rich structured query language for text collections of up to
50 million documents (single machine) or 500 million documents
(distributed search). Available for Linux and Windows.
Galago is a Java toolkit for experimenting with text search. It is
based on small, pluggable components that are easy to replace and
change, both during indexing and during retrieval.
The Lemur Toolkit is designed to facilitate research in language
modeling and information retrieval (IR), where IR is broadly
interpreted to include such technologies as ad hoc and distributed
retrieval with structured queries, cross-language IR, summarization,
filtering, and categorization. The system's underlying architecture
was built to support the technologies above. We provide many useful
sample applications, but have designed the toolkit to allow you to
easily program your own customizations and applications. The final released version of the Lemur Toolkit is version 4.12, released 06/21/2010.
A web browser plugin that captures user search and browing behavior
to support research on information seeking behavior, learning to rank,
and related topics. Available for Firefox and Internet Explorer.
RankLib is a library of learning to rank algorithms. Full details in the RankLib Documentation on the Lemur Project wiki.
A dataset of 1 billion high PageRank web pages in ten
languages collected in January and February, 2009. The dataset is
used by several tracks of the TREC conference in 2009 and 2010.
A dataset currently being created with the objective of having 1 billion
English pages collected February through APril 2012.