ClueWeb12 Related Data
The resources listed below may be useful when working with the ClueWeb12 dataset. Some were produced by the Lemur Project, and some were produced by other organizations.
Derived Data Produced by the Lemur Project
The resources listed below were created by analyzing the ClueWeb12 corpus.
- PageRank: PageRank scores for ClueWeb12 (Full dataset and B13).
- Redirects: Redirect Information for the ClueWeb12 dataset.
- WebGraph: The full webgraph for the ClueWeb12 dataset.
- ClueWeb12_All_edocid2url.txt.bz2: (11G) ClueWeb12 dataset (733,019,372 documents) Mapping of External Document Ids to URLs. Each line represents one document in the format: <edocid> <document url> (delimited by one space).
Derived Data Produced by Other Research Groups
The resources listed below were created by analyzing the ClueWeb12 corpus.
- Anchor text (Twente Univ): Provided by Djoerd Hiemstra.
- TREC 2013 Contextual Suggestions Track Collection (Waterloo, CSIRO, Amsterdam): Documents used in the TREC 2013 Contextual Suggestions Track (password required).
- Spam Rankings (Waterloo Univ): For in the ClueWeb12 dataset, provided by Mark Smucker.
- LDA topic models (1000 topics) on the full ClueWeb12 dataset, provided by Carsten Eickhoff.
Related Data Produced by Other Research Groups
- FACC1 Freebase Annotations (Google Inc): Freebase entity annotations for the ClueWeb12 dataset.