ClueWeb09 Related Data
The resources listed below may be useful when working with the ClueWeb12 dataset. Some were produced by the Lemur Project, and some were produced by other organizations.
Derived Data Produced by the Lemur Project
The resources listed below were created by analyzing the ClueWeb09 corpus.
- Anchor text query log: For the English subset of the Category A dataset.
- Duplicate URLs: There are about 0.2% duplicate URLs in the Category A set.
- PageRank: PageRank scores for both Categories A and B.
- Redirects: Redirect Information for the Category B dataset
- Web Graph: Information on the web graph of nodes and oulinks for the dataset
Derived Data Produced by Other Research Groups
The resources listed below were created by analyzing the ClueWeb09 corpus.
- Anchor Text (Twente University): For most of the Category A dataset, provided by Djoerd Hiemstra.
- Sketch Engine (Masaryk University): Search a copy of ClueWeb09 that has had cleaning, paragraph-level deduplication, POS-tagging and lemmatisation (TreeTagger).
- Spam Rankings (University of Waterloo): For each page in the ClueWeb09 dataset, provided by Gord Cormack.
- ur<x>l -> docid mapping (NIST): Clueweb09 document ID mappings, provided by Ian Soboroff.
- TREC 2011 Crowdsourcing track (University of Waterloo): Clueweb09 subset used for TREC 2011 Crowdsourcing track prepared by Mark Smucker.
- APRoPAT (University of Petra): Arabic Associative Root-Pattern Data. About 11.5 billion word forms and 9.3 million associative relationships created by analyzing Arabic pages in ClueWeb09. Provided by Bassam Haddad.
Related Data Produced by Other Research Groups
- FACC1 Freebase Annotations (Google Inc): Freebase entity annotations for the ClueWeb09 dataset.
- TREC Freebase Queries (Google, Inc): Freebase annotations for TREC Million Query Track and Web Track queries.