ClueWeb09 Derived Data
- Duplicate URLs (CMU): There are about 0.2% duplicate URLs in the Category A set.
- PageRank (CMU): PageRank scores for both Categories A and B.
- Redirects (CMU): Redirect Information for the Category B dataset
- Web Graph (CMU): Information on the web graph of nodes and oulinks for the dataset
- Anchor Text (Twente): For most of the Category A dataset, provided by Djoerd Hiemstra.
- Spam Rankings (Waterloo): For each page in the ClueWeb09 dataset, provided by Gord Cormack.
- ur<x>l -> docid mapping (NIST): Clueweb09 document ID mappings, provided by Ian Soboroff.
- Anchor text query log (UMass): For the English subset of the Category A dataset.



