Overview of the Lemur Toolkit

Contents

  1. What is Lemur?
  2. What kinds of things can Lemur do?
  3. How can it be useful?
  4. What have people used Lemur for?
  5. How can I use Lemur?
  6. What does Lemur come with?
  7. What was Lemur written in, and what platforms does it work on?

1. What is Lemur?

Lemur is a toolkit designed to facilitate research in language modeling and information retrieval (IR), where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval, with structured queries, cross-language IR, summarization, filtering, and categorization.  The system's underlying architecture was built to support the technologies above.  We provide many useful sample applications, but have designed the toolkit to allow you to easily program your own customizations and applications.

2. What kinds of things can Lemur do?

The Lemur toolkit supports the construction of basic text retrieval systems using language modeling methods, as well as traditional methods such as those based on the vector space model and Okapi. As the toolkit evolves, it is expected that it will support research in a broader range of information technologies such as filtering, and even question answering.

3. How can it be useful?

Lemur is particularly useful for researchers in language modeling and information retrieval who do not want to write their own indexers but would rather focus on developing new techniques and algorithms. However, in addition to indexing, we provide some baseline retrieval algorithms, such as Okapi and KL Divergence for use and comparisons.

You can use Lemur to build your own search systems. We have implemented and included basic ad hoc IR, distributed IR, IR using structured queries, IR using distributed indexes, clustering documents, and summarization. Others have used Lemur for filtering tasks, webpage finding, passage finding, and web search engines.

4. What have people used Lemur for?

The toolkit has been used to carry out experiments on several different aspects of language modeling for ad hoc retrieval. For example, it has been used to compare smoothing strategies for document models, and query expansion methods to estimate query models on standard TREC collections; for examples of its use see the SIGIR 2001 paper " A study of smoothing methods for language models applied to ad hoc information retrieval."

The toolkit has also been used for tasks at TREC, including filtering and web page-finding. It has been used in classrooms for instruction about information retrieval and web search engines. It also supports research projects in various other aspects of IR, such as question answering and distributed networks.

5. How can I use Lemur?

Lemur has many applications for indexing and retrieval that are fully functional for many purposes, so you can use them "out of the box". In addition, since Lemur was written to facilitate research on LM and IR, the design allows you to try out new retrieval methods by subclassing abstract interfaces, or write new applications based on existing methods.

The source code is provided to encourage users to modify the toolkit in support of their own research, development, or teaching activities. All are welcome and encouraged to submit their modifications to the Lemur project developers, so that they can be considered for inclusion in subsequent versions of the toolkit.

6. What does Lemur come with?

Lemur comes with all the source code and makefiles necessary to build the libraries for indexing and retrieval (under a CMU and UMass licensing agreement). For windows, you can download the pre-compiled libraries and executables.

Lemur currently supports the following features:

The Lemur Toolkit includes CGI code that uses Lemur indexes and a stand-alone GUI that does retrieval using methods included with Lemur. For a full list of applications that come with Lemur, see Lemur Applications page.

The Lemur toolkit download also includes a small sample data file with test scripts that use our applications. Expected results from these scripts are available on our website.

7. What was Lemur written in, and what platforms does it work on?

Lemur was written primarily in C++. (The GUI is written with Java/Swing.)

It is compatible with UNIX (linux and solaris) and Windows XP. Although we do not currently support them officially, people also run it on cygwin, Windows 2000, and Windows NT.