ClueWeb09 Information
How to Get the Dataset
- ClueWeb09 Dataset and Licensing Home Page : The official ClueWeb09 licensing information page
Dataset Details
- Dataset Information : Information on the structure of the dataset on disk, the formatting of the data and extra information.
- Sample Files : Sample files in various languages from the ClueWeb09 dataset
- Language Identification : How language identification was performed on the dataset
- Page Encodings : How the character encodings for the dataset are formatted
How to Use the Dataset
- Working with WARC Files : Information on working with the WARC files in the ClueWeb09 dataset
- Indexing with Indri and Lemur : Notes on indexing the ClueWeb09 Dataset using Indri and the Lemur Toolkit
Derived Data
- Duplicate URLs (CMU): There are about 0.2% duplicate URLs in the Category A set.
- PageRank (CMU): PageRank scores for both Categories A and B.
- Redirects (CMU): Redirect Information for the Category B dataset
- Web Graph (CMU): Information on the web graph of nodes and oulinks for the dataset
- Anchor Text (Twente): For most of the Category A dataset, provided by Djoerd Hiemstra.
- Spam Rankings (Waterloo): For each page in the ClueWeb09 dataset, provided by Gord Cormack.
- ur
l -> docno mapping (NIST): Clueweb09 document number mappings, provided by Ian Soboroff.
- Anchor text query log (UMass): For the English subset of the Category A dataset.
Related Services
- Search Category B: Use the Indri search engine to search the ClueWeb09 Category B dataset
- Search Category A - English: Use the Indri search engine to search the English subset of the ClueWeb09 Category A dataset
- Page Rendering Service: Render selected ClueWeb09 web pages (text + images).
- Attribute Lookup Service: Fast lookup of selected ClueWeb09 document attributes.
- More information about using the CGI Web interface to display document contents etc.
Staying Informed
- ClueWeb09 Mailing List : The ClueWeb09 mailing list
Acknowledgements
The creation of the ClueWeb09 dataset was sponsored by National Science Foundation grant IIS-0841275, under its Cluster Exploratory program. We thank Google and IBM for the use of the CluE computer cluster. We thank Nick Craswell, Dennis Fetterly, Don Metzler, NIST's ITL Retrieval Group, and Yahoo! for their assistance and advice. We thank the Wikimedia Foundation for enabling the inclusion of the English wikipedia. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsors. |