HOWTO: Processing ClueWeb09 with Indri
Version 4.10, released 06/22/2009, contains all of the updates mentioned below for using version 4.9 of the Lemur toolkit with the ClueWeb09 collection. The instructions will remain for a time for those who have not yet updated to version 4.10.
Our goal with the 04/9/2009 release of the Lemur Toolkit version 4.9 was to provide a version capable of processing the ClueWeb09 collection in its packaged format. While most of the way there, there are some deficiencies in the 4.9 release that prevent it from fully doing so. The next release, scheduled for June 22,2009, will address these issues. Until that time, users who want to get started with the ClueWeb09 collection will need to take some additional steps in order to follow the instructions laid out in this document. This document is expected to evolve over time. There is an active thread on this topic on the SourceForge hosted forums.
Processing the ClueWeb09 collection
Starting with the Lemur Toolkit version 4.9 distribution, by updating certain source files from the CVS repository, hosted on SourceForge, it is possible to build 24 Indri Repositories, one for each language segment, using between 50-100 CPU hours per segment, and 7 terabytes of available disk space, with the final Repositories consuming roughly 3.5 terabytes.
Updating the Lemur Toolkit source code
There are some issues and improvements with the 4.9 distribution that need to be addressed to enable indexing of the ClueWeb09 collection. The easiest way to do this is to check the current HEAD version out of CVS, configure, compile, and install.
- The warc FileClassEnvironment loses some records due to improper processing of the content length.
- The caching of document lengths (an optimization added to 4.9) is not thread safe when querying.
- The CompressedCollection storage of ParsedDocuments consumes a great deal of disk space, more than doubling the amount of disk needed. This data is not needed when performing TREC-style evalution runs (generation of ranked lists only), so the storage has been made optional.
- The HTMLParser fails on hrefs that use a single quote, rather than double quotes, eg, <a href='some_url'>, causing a crash.
- The warc FileClassEnvironment has been updated to read gzip (.gz) files directly, removing the need to decompress and store the input files.
Resource Requirements
The results reported here were generated using a cluster of 24 machines, configured as follows:
- 2 3.2GHz Xeon cpu
- OS: Linux compute-0-10.local 2.6.9-55.0.2.EL_lustre.1.6.2smp #1 SMP Mon Aug 20 17:39:48 MDT 2007 x86_64 x86_64 x86_64 GNU/Linux
- 4 GB physical memory
- 1 TB WD Caviar local disk
- ClueWeb09 disks mounted via NFS. Each in a USB external enclosure, using 4 separate NFS servers.
Indexing and Parameters
A new parameter, storeDocs, has been added to IndriBuildIndex. When this parameter is set to false, the ParsedDocument data structure is not stored in the CompressedCollection. The warc FileClassEnvironment will automatically recognize the input of gzipped files, and process them accordingly.
To obtain the mapping between document ids and urls, specify the field, url, as both a forward and a backward metadata field, eg:
<metadata> <forward>url</forward> <backward>url</backward> </metadata>The url metadata field is inserted by the HTMLParser automagically when processing the documents.
TREC-B (English_1)
If you only want to index the TREC-B collection, found in English_1, follow the English_1 - English_10 section below, limiting yourself to the single segment.
English_1 - English_10
The English segments were processed with the default stoplist, data/stoplist.dft formatted as an indri parameter file, with the form:
<parameters> <stopper> <word>a</word> <word>an</word> ... </stopper> </parameters>The data was stemmed with the Krovetz stemmer. The fields title and heading (H1, H2, H3, H4, H5) were indexed as data. The field docno was indexed as forward and reverse metadata (performed automagically by IndriBuildIndex). No other metadata was indexed. The parameter file for English_1 (TREC-B) is below:
<parameters> <memory>2560M</memory> <storeDocs>false</storeDocs> <index>ClueWeb09_English_1</index> <corpus> <path>/mnt/nfs/clue1/ClueWeb09_English_1</path> <class>warc</class> </corpus> <field><name>title</name></field> <field><name>heading</name></field> <stemmer><name>krovetz</name></stemmer> </parameters>Indexing was performed on each machine with:
$ IndriBuildIndex index.param stopwords.param
Timings and Disk Usage
TREC B set:
0-24 ClueWeb09_English_1
2659:08: Documents parsed: 50220423 Documents indexed: 50220423
2659:08: Closing index
3986:16: Finished
159242.91user 54870.96system 66:26:18elapsed 89%CPU
0inputs+0outputs (2803839major+93043849minor)pagefaults 0swaps
[dfisher@compute-0-24 task_1]$ du --si
171G ./ClueWeb09_English_1/index/153
171G ./ClueWeb09_English_1/index
2.8G ./ClueWeb09_English_1/collection
174G ./ClueWeb09_English_1
English 1-10 (503903810 documents):
1499311.31user 830973.74system
647.3 CPU hours, ~719 hours @ 90% CPU == 30 days
778470 docs/hour, 700645 docs/hour @ 90% CPU
[dfisher@sydney ClueWeb09]$ du --si -sc ClueWeb09_English_*
174G ClueWeb09_English_1
165G ClueWeb09_English_2
163G ClueWeb09_English_3
164G ClueWeb09_English_4
159G ClueWeb09_English_5
160G ClueWeb09_English_6
164G ClueWeb09_English_7
153G ClueWeb09_English_8
157G ClueWeb09_English_9
119G ClueWeb09_English_10
1.6T total
Full ClueWeb09 996026705 documents
3461321.1user 1720054.56system
1439.27 CPU hours, ~1599 hours @ 90% CPU == 67 days
692166 docs/hour, 622906 docs/hour @ 90% CPU
When selecting a value for the memory parameter, a good target value is roughly 60% of physical memory. This allows enough memory for the index writing thread to work while the indexing thread continues without pushing the machine into swap. Increasing the memory parameter reduces the number of temporary indexes written to disk, affording some speedup to the indexing process as a whole. IndriBuildIndex will utilize two CPUs effectively, having additional CPUs in the machine will not afford any appreciable speedup (assuming an otherwise unladen machine).
The Other Languages
It is possible to index each of the other language segments with the parameters used for the English segments by removing the stemmer parameter. There are issues that are currently being worked on with respect to using the resultant indexes effectively.
- The documents in the WARC file are in multiple encodings, eg some UTF8, some GBK, etc. IndriBuildIndex is not able to perform encoding normalization, and expects UTF8 input. The other encodings will be indexed, but querying with a UTF8 encoded query will not find those documents.
- The unsegmented languages, eg Chinese and Japanese, will be indexed with word breaks occurring on token boundaries identfied by UTF8 encoded punctation characters and white space. This is clearly not a sensible default for those languages. A single character indexing FileClassEnvironment, warcchar, is currently under development. This will enable indexing each individual character as a separate word.
Retrieval and Parameters
Initial retrieval runs using the first 100 queries from the TREC Efficiency Track (06) have been performed with the TREC-B collection, and the full English 1-10. The queries average 4 words and use no indri query language operators. Retrieval for the full set used 10 IndriDaemon servers, one on each of the compute nodes that had built the English segment. Retrieval on the TREC-B set used the single Repository on an NFS mounted partition. Both runs used a machine configured with: 8 3GHz cpus, 8GB physical memory. This data provides an initial estimate of query throughput.
Timings
TREC-B (English_1)
1-thread:
1173.47user 41.33system 33:31.13elapsed 60%CPU
12.2 sec/query CPU, 20.6 sec/query wall clock
6-threads:
1806.96user 77.90system 5:15.34elapsed 597%CPU
18.9 sec/query CPU, 3.2 sec/query wall clock
The number of threads is specified via the threads parameter to
IndriRunQuery, eg:
<parameters> <threads>6</threads> <parameters>
English_1 - English_10
6-threads:
9.79user 18.41system 13:49.28elapsed 3%CPU
8.29 sec/query wall clock (CPU time not instrumented on 10 servers).
The run was performed using 10 indrid servers, one for each of the 10 English segments. The indrid parameters are:
<parameters> <index>/path/to/segment/index</index> <parameters>one run on each of ten machines (compute-[1..10]).
The IndriRunQuery parameters include a server entry for each of the hosts.
<parameters> <server>compute-1</server> <server>compute-2</server> <server>compute-3</server> <server>compute-4</server> <server>compute-5</server> <server>compute-6</server> <server>compute-7</server> <server>compute-8</server> <server>compute-9</server> <server>compute-10</server> <parameters>