Anchor Text Query Log for ClueWeb09


Employing the techniques described in [Dang & Croft 10], a simulated query log has been constructed from the anchor text from the ~500 million English documents in the ClueWeb09 collection (ClueWeb09_English_{1-10}). Two different versions are available for download.

Dataset Description and Preprocessing

Anchor Text Query Log Input Data

All URL and anchor text pairs in the ClueWeb09 English subset were collected.

The following processing steps were applied:

Anchor Text Query Logs

The whole anchor text collection (244G, not distributed online) was used to generate the two query log versions.

The downloadable versions are distributed as a compressed tar archive containing a number of compressed files. Each file contains multiple records, one record per line. Each record has 3 TAB-separated components:

  1. Link To (URL)
  2. Anchor Text
  3. Frequency of the tuple <Link To, Anchor Text>

Downloadable Data

Clue-500M-Anchor-Log-External.tar.gz (11G) checksum: Clue-500M-Anchor-Log-External.tar.gz.md5 (4K)
Only URL, anchor text pairs associated with external links are included. Internal links, eg "<a href="#name">name</a>", are excluded.
Clue-500M-Anchor-Log-All.tar.gz (64G) checksum: Clue-500M-Anchor-Log-All.tar.gz.md5 (4K)
All URL, anchor text pairs are kept.

Downloadable Code

TF-Build-Anchor-Log.tar.gz (2M) checksum: TF-Build-Anchor-Log.tar.gz.md5 (4K)
The galago program to generate the simulated query log from the anchor text collection


V. Dang and W.B. Croft. Query Reformulation Using Anchor Text. In Proc. of WSDM, pages 41-50, 2010.