Anchor Text Query Log for ClueWeb09
Overview
Employing the techniques described in [Dang & Croft 10], a simulated query log has been constructed from the anchor text from the ~500 million English documents in the ClueWeb09 collection (ClueWeb09_English_{1-10}). Two different versions are available for download.
Dataset Description and Preprocessing
Anchor Text Query Log Input Data
All URL and anchor text pairs in the ClueWeb09 English subset were collected.
The following processing steps were applied:
- All invalid URLS (including URLs that contain tab characters) were removed
- All non-standard characters were removed from the anchor text values. This includes control characters and other unreadable characters
- Anchor texts which contained only numbers were removed
- All anchor text strings that appear in their entirety in the list TF-Build-Anchor-Log/data/clue-stop-anchor-whole.txt checksum: TF-Build-Anchor-Log/data/clue-stop-anchor-whole.txt.md5 were discarded.
- All anchor text strings that contain any of the words that appear in the list TF-Build-Anchor-Log/data/clue-stop-anchor-contain.txt checksum: TF-Build-Anchor-Log/data/clue-stop-anchor-contain.txt.md5 were discarded.
Anchor Text Query Logs
The whole anchor text collection (244G, not distributed online) was used to generate the two query log versions.
The downloadable versions are distributed as a compressed tar archive containing a number of compressed files. Each file contains multiple records, one record per line. Each record has 3 TAB-separated components:
- Link To (URL)
- Anchor Text
- Frequency of the tuple <Link To, Anchor Text>
Downloadable Data
- Clue-500M-Anchor-Log-External.tar.gz (11G) checksum: Clue-500M-Anchor-Log-External.tar.gz.md5 (4K)
- Only URL, anchor text pairs associated with external links are included. Internal links, eg "<a href="#name">name</a>", are excluded.
- Clue-500M-Anchor-Log-All.tar.gz (64G) checksum: Clue-500M-Anchor-Log-All.tar.gz.md5 (4K)
- All URL, anchor text pairs are kept.
Downloadable Code
- TF-Build-Anchor-Log.tar.gz (2M) checksum: TF-Build-Anchor-Log.tar.gz.md5 (4K)
- The galago program to generate the simulated query log from the anchor text collection
References
V. Dang and W.B. Croft. Query Reformulation Using Anchor Text. In Proc. of WSDM, pages 41-50, 2010.