Anchor Text Query Log for ClueWeb09

Overview

Employing the techniques described in [Dang & Croft 10], a simulated query log has been constructed from the anchor text from the ~500 million English documents in the ClueWeb09 collection (ClueWeb09_English_{1-10}). Two different versions are available for download.

Dataset Description and Preprocessing

Anchor Text Query Log Input Data

All URL and anchor text pairs in the ClueWeb09 English subset were collected.

The following processing steps were applied:

All invalid URLS (including URLs that contain tab characters) were removed
All non-standard characters were removed from the anchor text values. This includes control characters and other unreadable characters
Anchor texts which contained only numbers were removed
All anchor text strings that appear in their entirety in the list TF-Build-Anchor-Log/data/clue-stop-anchor-whole.txt checksum: TF-Build-Anchor-Log/data/clue-stop-anchor-whole.txt.md5 were discarded.
All anchor text strings that contain any of the words that appear in the list TF-Build-Anchor-Log/data/clue-stop-anchor-contain.txt checksum: TF-Build-Anchor-Log/data/clue-stop-anchor-contain.txt.md5 were discarded.

Anchor Text Query Logs

The whole anchor text collection (244G, not distributed online) was used to generate the two query log versions.

The downloadable versions are distributed as a compressed tar archive containing a number of compressed files. Each file contains multiple records, one record per line. Each record has 3 TAB-separated components:

Link To (URL)
Anchor Text
Frequency of the tuple <Link To, Anchor Text>

Downloadable Data

Clue-500M-Anchor-Log-External.tar.gz (11G) checksum: Clue-500M-Anchor-Log-External.tar.gz.md5 (4K): Only URL, anchor text pairs associated with external links are included. Internal links, eg "<a href="#name">name</a>", are excluded.
Clue-500M-Anchor-Log-All.tar.gz (64G) checksum: Clue-500M-Anchor-Log-All.tar.gz.md5 (4K): All URL, anchor text pairs are kept.

Downloadable Code

TF-Build-Anchor-Log.tar.gz (2M) checksum: TF-Build-Anchor-Log.tar.gz.md5 (4K): The galago program to generate the simulated query log from the anchor text collection

References

V. Dang and W.B. Croft. Query Reformulation Using Anchor Text. In Proc. of WSDM, pages 41-50, 2010.