ClueWeb12 Related Data:
Freebase Annotations of the ClueWeb Corpora, v1 (FACC1)
Researchers at Google annotated English-language Web pages from the ClueWeb09 and ClueWeb12 corpora. The annotation process was automatic, and hence imperfect. However, the annotations are of generally high quality, as they strove for high precision (and, by necessity, lower recall). For each entity they recognized with high confidence, they provide the beginning and end byte offsets of the entity mention in the input text, its Freebase identifier (mid), and two confidence levels (computed differently, see below).
You might consider using this data in conjunction with the recently released Freebase annotations of several TREC query sets.
Data Description
The annotations for each corpus are provided as a collection of 500 files (the partition into individual files is somewhat arbitrary). Each file contains annotations of multiple web pages, and each page URL is followed by a list of entities identified in that page.
Here is an excerpt from an annotation file:
clueweb09-en0000-00-04720.html
21089 | 21092 | 0.99763662 | 6.6723776e-05 | /m/0600q | |
FDA | 21303 | 21306 | 0.9998256 | 0.00057182228 | /m/032mx |
Food and Drug Administration | 21312 | 21340 | 0.9998256 | 0.00057182228 | /m/032mx |
In this example,
- "clueweb09-en0000-00-04720.html" is the name of the document that was annotated
- "PDF" is the entity mention in text
- 21089 and 21092 are the beginning and end byte offsets of the entity mention in the input text. The zero (0) location used for calculating the annotation offsets is the beginning of the HTTP headers. This is the first byte after the WARC document header.
- 0.99763662 is the posterior of an entity given both the mention and the context (of the mention)
- 6.6723776e-05 is the posterior given just the context of the mention (ignoring the mention string itself)
- /m/0600q - Freebase identifier for the entity. To look up the entity in Freebase, just prepend the string "http://www.freebase.com" before the identifier, like so: "http://www.freebase.com/m/0600q".
Some documents do not have any annotations (and are not included in the annotation files) because no Freebase entity was recognized in them with high confidence.
There are 340,451,982 documents in ClueWeb09 and 456,498,584 documents in ClueWeb12 with at least one entity annotated. On average, ClueWeb09 documents have 15 entity mentions annotated, and ClueWeb12 documents have 13 mentions annotated. Additional statistics are available in the companion files "ClueWeb09_stats.txt" and "ClueWeb12_stats.txt".
Due to the sheer size of the data, it was not possible to verify all the automatic annotations manually. Based on a small-scale human evaluation, the precision (at the currently chosen threshold) is believed to be around 80-85%. Estimating the recall is of course difficult; however, it is believed to be around 70-85%.
Processed Data
The following files are the annotations provided by Google, Inc and processed by The Lemur Project. The processed files are text files with the extension .anns.tsv. The annotations are converted to the a format that we believe is easier to process. Each .anns.tsv file contains the annotations for the documents corresponding to the warc.gz file. For instance, Disk1/ClueWeb12_00/0000tw/0000tw.anns.tsv contains the annotations for the documents contained in the corpus file Disk1/ClueWeb12_00/0000tw/0000tw.warc.gz. The file format adds two additional columns to the beginning of each line:
- Document identifier (WARC-TREC-ID)
- Original encoding: The name of the encoding used to process the entry. This encoding is used to calulate the beginning and end byte offsets of the entity in the corpus document.
clueweb12-0000tw-00-00000 | UTF-8 | Flash Player | 12770 | 12782 | 1.000000 | 0.024096 | /m/05qh6g |
clueweb12-0000tw-00-00001 | UTF-8 | Flash Player | 12815 | 12827 | 1.000000 | 0.024096 | /m/05qh6g |
clueweb12-0000tw-00-00002 | UTF-8 | Flash Player | 12848 | 12860 | 1.000000 | 0.024096 | /m/05qh6g |
clueweb12-0000tw-00-00003 | UTF-8 | Flash Player | 12912 | 12924 | 1.000000 | 0.024096 | /m/05qh6g |
clueweb12-0000tw-00-00004 | UTF-8 | Flash Player | 12896 | 12908 | 1.000000 | 0.024096 | /m/05qh6g |
clueweb12-0000tw-00-00007 | UTF-8 | light of the world | 13081 | 13099 | 0.969305 | 0.000015 | /m/0gjf48y |
clueweb12-0000tw-00-00008 | UTF-8 | God | 9099 | 9102 | 0.809834 | 0.000100 | /m/0d05l6 |
clueweb12-0000tw-00-00008 | UTF-8 | God | 12217 | 12220 | 0.809834 | 0.000100 | /m/0d05l6 |
clueweb12-0000tw-00-00008 | UTF-8 | God | 12461 | 12464 | 0.809834 | 0.000100 | /m/0d05l6 |
Note: You can extract the content of the following files ( compressed tar file) using the command: tar -zxvf ClueWeb12_*
- ClueWeb12_00.tgz (7.2G, compressed): The annotations for ClueWeb12 Disk1/ClueWeb12_00 documents.
- ClueWeb12_01.tgz (6.5G, compressed): The annotations for ClueWeb12 Disk1/ClueWeb12_01 documents.
- ClueWeb12_02.tgz (6.6G, compressed): The annotations for ClueWeb12 Disk1/ClueWeb12_02 documents.
- ClueWeb12_03.tgz (6.3G, compressed): The annotations for ClueWeb12 Disk1/ClueWeb12_03 documents.
- ClueWeb12_04.tgz (4.9G, compressed): The annotations for ClueWeb12 Disk1/ClueWeb12_04 documents.
- ClueWeb12_05.tgz (3.0G, compressed): The annotations for ClueWeb12 Disk1/ClueWeb12_05 documents.
- ClueWeb12_06.tgz (3.2G, compressed): The annotations for ClueWeb12 Disk2/ClueWeb12_06 documents.
- ClueWeb12_07.tgz (3.9G, compressed): The annotations for ClueWeb12 Disk2/ClueWeb12_07 documents.
- ClueWeb12_08.tgz (4.8G, compressed): The annotations for ClueWeb12 Disk2/ClueWeb12_08 documents.
- ClueWeb12_09.tgz (4.8G, compressed): The annotations for ClueWeb12 Disk2/ClueWeb12_09 documents.
- ClueWeb12_10.tgz (4.6G, compressed): The annotations for ClueWeb12 Disk3/ClueWeb12_10 documents.
- ClueWeb12_11.tgz (4.4G, compressed): The annotations for ClueWeb12 Disk3/ClueWeb12_11 documents.
- ClueWeb12_12.tgz (4.1G, compressed): The annotations for ClueWeb12 Disk3/ClueWeb12_12 documents.
- ClueWeb12_13.tgz (3.7G, compressed): The annotations for ClueWeb12 Disk3/ClueWeb12_13 documents.
- ClueWeb12_14.tgz (3.6G, compressed): The annotations for ClueWeb12 Disk3/ClueWeb12_14 documents.
- ClueWeb12_15.tgz (4.4G, compressed): The annotations for ClueWeb12 Disk4/ClueWeb12_15 documents.
- ClueWeb12_16.tgz (3.6G, compressed): The annotations for ClueWeb12 Disk4/ClueWeb12_16 documents.
- ClueWeb12_17.tgz (4.0G, compressed): The annotations for ClueWeb12 Disk4/ClueWeb12_17 documents.
- ClueWeb12_18.tgz (4.4G, compressed): The annotations for ClueWeb12 Disk4/ClueWeb12_18 documents.
- ClueWeb12_19.tgz (3.6G, compressed): The annotations for ClueWeb12 Disk4/ClueWeb12_19 documents.
- checksums.md5 (4k): File "checksums.md5" contains the md5 sums of all 10 of the *.tgz files. These MD5 sums are in the format: <md5 checksum hash> <file>
Using the annotations data with Indri Query Environment
The follow code snippet is an example on how to use the Indri Query Environment to grab a ClueWeb document using (a) an Indri index with storeDocs=true (ParsedDocument data structure is stored with the index), and (b) a document id. The code also shows how to find the beginning of the HTTP headers, which is used as zero (0) when calculating the annotation offsets.
The annotation entries include the mime-type used to calculate the offset to the annotation. The Generic Charset Conversion Interface is one way to convert the text so you can locate the annotation in the document using the entry's offsets.
#include <stdio.h>
#include <string>
#include <iconv.h>
#include "indri/QueryEnvironment.hpp"
#include "indri/QueryExpander.hpp"
.
.
.
/* Hard code document id for this example. */
string docID="clueweb12-0000tw-00-00000";
indri::api::QueryEnvironment *env = new indri::api::QueryEnvironment();
/* Set whether there should be one single background model or context sensitive models;
(background true for one background model; false for context sensitive models) */
env->setSingleBackgroundModel(false);
/* Add a local repository using the full pathname to the repository */
env->addIndex("/path/to/index/");
std::vector
idList.push_back(docID.c_str());
/* Fetch all documents with a metadata key docno that matches docID. */
std::vector
std::vector
/* Check to see if we found a document */
if( documents.size() ) {
/* Just take the first document (since docids are unique, there should be only one document) */
indri::api::ParsedDocument* document = documents[0];
/* document->getContent() will not include the HTTP headers, therefore we have to grab the complete document which includes the WARC header and content block, HTTP headers and document.*/
string t = document->text;
/* Convert document text to correct mime-type - possibly using iconv_open function. */
.
.
.
/* Find the beginning of the HTTP headers. A WARC record consists of a record header followed by a record content block and two newlines (Newlines are CRLF as per other Internet standards.). The HTTP headers and document follow.*/
size_t start = correctMimeTypeString.find("\n\n") + 2 ;
/* We now have start location (zero) of the annotation offsets. You can proceed with finding annotations using the provided offsets. */
.
.
.
} else {
cout << "document not found" << endl;
}
env->close();
Citation
If you use this data in a publication, please cite it as:
- Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag Subramanya, "FACC1: Freebase annotation of ClueWeb corpora, Version 1 (Release date 2013-06-26, Format version 1, Correction level 0)", June 2013.
Please also include in the citation the following URL(s) where the data is available (http://lemurproject.org/clueweb09/ and/or http://lemurproject.org/clueweb12/).
Other Data From Google
If you would like to learn more about data releases from Google, you may wish to consider subscribing to this low-traffic mailing list: http://goo.gl/MJb3A.
Acknowledgments
This data set was prepared by Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag Subramanya (Google), and Juan Caicedo Carvajal (CMU).
Thanks to Johnny Chen, John Giannandrea, Rahul Gupta, Jesse Saba Kirchner, Ruichao Li, Eisar Lipkovitz, Jeremy O'Brien, Dave Orr, Fernando Pereira, Dave Price, and Chuck Wu for making this release possible.