ClueWeb09 Information

The ClueWeb09 dataset was created by the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies. The dataset consists of 1 billion web pages, in ten languages, collected in January and February 2009. The dataset is used by several tracks of the TREC conference.

How to Get the Dataset

ClueWeb09 Dataset and Licensing Home Page : The official ClueWeb09 licensing information page

Dataset Details

Dataset Information : Information on the structure of the dataset on disk, the formatting of the data and extra information.

Sample Files : Sample files in various languages from the ClueWeb09 dataset

Language Identification : How language identification was performed on the dataset

Page Encodings : How the character encodings for the dataset are formatted

How to Use the Dataset

Working with WARC Files : Information on working with the WARC files in the ClueWeb09 dataset

Indexing with Indri and Lemur : Notes on indexing the ClueWeb09 Dataset using Indri and the Lemur Toolkit

Derived Data

Duplicate URLs (CMU): There are about 0.2% duplicate URLs in the Category A set.

PageRank (CMU): PageRank scores for both Categories A and B.

Redirects (CMU): Redirect Information for the Category B dataset

Web Graph (CMU): Information on the web graph of nodes and oulinks for the dataset

Anchor Text (Twente): For most of the Category A dataset, provided by Djoerd Hiemstra.

Spam Rankings (Waterloo): For each page in the ClueWeb09 dataset, provided by Gord Cormack.

url -> docno mapping (NIST): Clueweb09 document number mappings, provided by Ian Soboroff.

Anchor text query log (UMass): For the English subset of the Category A dataset.

Related Services

Search Category B: Use the Indri search engine to search the ClueWeb09 Category B dataset

Search Category A - English: Use the Indri search engine to search the English subset of the ClueWeb09 Category A dataset

Page Rendering Service: Render selected ClueWeb09 web pages (text + images).

Attribute Lookup Service: Fast lookup of selected ClueWeb09 document attributes.

More information about using the CGI Web interface to display document contents etc.

Staying Informed

ClueWeb09 Mailing List : The ClueWeb09 mailing list

Acknowledgements

The creation of the ClueWeb09 dataset was sponsored by National Science Foundation grant IIS-0841275, under its Cluster Exploratory program. We thank Google and IBM for the use of the CluE computer cluster. We thank Nick Craswell, Dennis Fetterly, Don Metzler, NIST's ITL Retrieval Group, and Yahoo! for their assistance and advice. We thank the Wikimedia Foundation for enabling the inclusion of the English wikipedia. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsors.