Anchor Text Query Log for ClueWeb09


Overview


Employing the techniques described in [Dang & Croft 10], a simulated query log has been constructed from the anchor text from the ~500 million English documents in the ClueWeb09 collection (ClueWeb09_English_{1-10}). Two different versions are available for download.


Dataset Description and Preprocessing


Anchor Text Query Log Input Data

All URL and anchor text pairs in the ClueWeb09 English subset were collected.

The following processing steps were applied:

Anchor Text Query Logs

The whole anchor text collection (244G, not distributed online) was used to generate the two query log versions.

The downloadable versions are distributed as a compressed tar archive containing a number of compressed files. Each file contains multiple records, one record per line. Each record has 3 TAB-separated components:

  1. Link To (URL)
  2. Anchor Text
  3. Frequency of the tuple <Link To, Anchor Text>

Downloadable Data


Clue-500M-Anchor-Log-External.tar.gz (11G) checksum: Clue-500M-Anchor-Log-External.tar.gz.md5 (4K)
Only URL, anchor text pairs associated with external links are included. Internal links, eg "<a href="#name">name</a>", are excluded.
Clue-500M-Anchor-Log-All.tar.gz (64G) checksum: Clue-500M-Anchor-Log-All.tar.gz.md5 (4K)
All URL, anchor text pairs are kept.

Downloadable Code


TF-Build-Anchor-Log.tar.gz (2M) checksum: TF-Build-Anchor-Log.tar.gz.md5 (4K)
The galago program to generate the simulated query log from the anchor text collection

References


V. Dang and W.B. Croft. Query Reformulation Using Anchor Text. In Proc. of WSDM, pages 41-50, 2010.