ClueWeb12 Related Data

The resources listed below may be useful when working with the ClueWeb12 dataset. Some were produced by the Lemur Project, and some were produced by other organizations.

Derived Data Produced by the Lemur Project

The resources listed below were created by analyzing the ClueWeb12 corpus.

PageRank: PageRank scores for ClueWeb12 (Full dataset and B13).
Redirects: Redirect Information for the ClueWeb12 dataset.
WebGraph: The full webgraph for the ClueWeb12 dataset.
ClueWeb12_All_edocid2url.txt.bz2: (11G) ClueWeb12 dataset (733,019,372 documents) Mapping of External Document Ids to URLs. Each line represents one document in the format: <edocid> <document url> (delimited by one space).

Derived Data Produced by Other Research Groups

The resources listed below were created by analyzing the ClueWeb12 corpus.

Anchor text (Twente Univ): Provided by Djoerd Hiemstra.
TREC 2013 Contextual Suggestions Track Collection (Waterloo, CSIRO, Amsterdam): Documents used in the TREC 2013 Contextual Suggestions Track (password required).
Spam Rankings (Waterloo Univ): For in the ClueWeb12 dataset, provided by Mark Smucker.
LDA topic models (1000 topics) on the full ClueWeb12 dataset, provided by Carsten Eickhoff.

Related Data Produced by Other Research Groups

FACC1 Freebase Annotations (Google Inc): Freebase entity annotations for the ClueWeb12 dataset.