ClueWeb09 Related Data:
Freebase Annotations of the ClueWeb Corpora, v1 (FACC1)

Researchers at Google annotated English-language Web pages from the ClueWeb09 and ClueWeb12 corpora. The annotation process was automatic, and hence imperfect. However, the annotations are of generally high quality, as they strove for high precision (and, by necessity, lower recall). For each entity they recognized with high confidence, they provide the beginning and end byte offsets of the entity mention in the input text, its Freebase identifier (mid), and two confidence levels (computed differently, see below).

You might consider using this data in conjunction with the recently released Freebase annotations of several TREC query sets.

Data Description

The annotations for each corpus are provided as a collection of 500 files (the partition into individual files is somewhat arbitrary). Each file contains annotations of multiple web pages, and each page URL is followed by a list of entities identified in that page.

Here is an excerpt from an annotation file:

clueweb09-en0000-00-04720.html

PDF	21089	21092	0.99763662	6.6723776e-05	/m/0600q
FDA	21303	21306	0.9998256	0.00057182228	/m/032mx
Food and Drug Administration	21312	21340	0.9998256	0.00057182228	/m/032mx

In this example,

"clueweb09-en0000-00-04720.html" is the name of the document that was annotated
"PDF" is the entity mention in text
21089 and 21092 are the beginning and end byte offsets of the entity mention in the input text. The zero (0) location used for calculating the annotation offsets is the beginning of the HTTP headers. This is the first byte after the WARC document header.
0.99763662 is the posterior of an entity given both the mention and the context (of the mention)
6.6723776e-05 is the posterior given just the context of the mention (ignoring the mention string itself)
/m/0600q - Freebase identifier for the entity. To look up the entity in Freebase, just prepend the string "http://www.freebase.com" before the identifier, like so: "http://www.freebase.com/m/0600q".

Some documents do not have any annotations (and are not included in the annotation files) because no Freebase entity was recognized in them with high confidence.

There are 340,451,982 documents in ClueWeb09 and 456,498,584 documents in ClueWeb12 with at least one entity annotated. On average, ClueWeb09 documents have 15 entity mentions annotated, and ClueWeb12 documents have 13 mentions annotated. Additional statistics are available in the companion files "ClueWeb09_stats.txt" and "ClueWeb12_stats.txt".

Due to the sheer size of the data, it was not possible to verify all the automatic annotations manually. Based on a small-scale human evaluation, the precision (at the currently chosen threshold) is believed to be around 80-85%. Estimating the recall is of course difficult; however, it is believed to be around 70-85%.

Processed Data

The following files are the annotations provided by Google, Inc and processed by The Lemur Project. The processed files are text files with the extension .anns.tsv. The annotations are converted to the a format that we believe is easier to process. Each .anns.tsv file contains the annotations for the documents corresponding to the warc.gz file. For instance, Disk1/ClueWeb09_English_1/en0000/00.anns.tsv contains the annotations for the documents contained in the corpus file Disk1/ClueWeb09_English_1/en0000/00.warc.gz. The file format adds two additional columns to the beginning of each line:

Document identifier (WARC-TREC-ID)
Original encoding: The name of the encoding used to process the entry. This encoding is used to calulate the beginning and end byte offsets of the entity in the corpus document.

The following is an example of the processed data format:

clueweb09-en0000-00-00005	UTF-8	G e	9188	9196	1.000000	0.000067	/m/03bnb
clueweb09-en0000-00-00005	UTF-8	COM	10850	10853	0.960598	0.000052	/m/01gj0k
clueweb09-en0000-00-00005	UTF-8	US	10856	10858	0.662877	0.005153	/m/09c7w0
clueweb09-en0000-00-00005	UTF-8	Domain Names	10871	10883	0.999885	0.000565	/m/09y1k
clueweb09-en0000-00-00011	ISO-8859-1	American	8115	8123	0.985868	0.005697	/m/09c7w0
clueweb09-en0000-00-00011	ISO-8859-1	US	14201	14203	0.985868	0.005697	/m/09c7w0
clueweb09-en0000-00-00011	ISO-8859-1	US	16075	16077	0.985868	0.005697	/m/09c7w0

Note: You can extract the content of the following files ( compressed tar file) using the command: tar -zxvf ClueWeb09_English_*

ClueWeb09_English_1.tgz (8.2G, compressed): The annotations for ClueWeb09 Disk1/ClueWeb09_English_1 documents.
ClueWeb09_English_2.tgz (8.9G, compressed): The annotations for ClueWeb09 Disk1/ClueWeb09_English_2 documents.
ClueWeb09_English_3.tgz (8.4G, compressed): The annotations for ClueWeb09 Disk1/ClueWeb09_English_3 documents.
ClueWeb09_English_4.tgz (7.9G, compressed): The annotations for ClueWeb09 Disk1/ClueWeb09_English_4 documents.
ClueWeb09_English_5.tgz (7.3G, compressed): The annotations for ClueWeb09 Disk1/ClueWeb09_English_5 documents.
ClueWeb09_English_6.tgz (7.3G, compressed): The annotations for ClueWeb09 Disk2/ClueWeb09_English_6 documents.
ClueWeb09_English_7.tgz (7.4G, compressed): The annotations for ClueWeb09 Disk2/ClueWeb09_English_7 documents.
ClueWeb09_English_8.tgz (5.4G, compressed): The annotations for ClueWeb09 Disk2/ClueWeb09_English_8 documents.
ClueWeb09_English_9.tgz (6.6G, compressed): The annotations for ClueWeb09 Disk2/ClueWeb09_English_9 documents.
ClueWeb09_English_10.tgz (4.9G, compressed): The annotations for ClueWeb09 Disk2/ClueWeb09_English_10 documents.

checksums.md5 (4k): File "checksums.md5" contains the md5 sums of all 10 of the *.tgz files. These MD5 sums are in the format: <md5 checksum hash> <file>

Using the annotations data with Indri Query Environment

The follow code snippet is an example on how to use the Indri Query Environment to grab a ClueWeb document using (a) an Indri index with storeDocs=true (ParsedDocument data structure is stored with the index), and (b) a document id. The code also shows how to find the beginning of the HTTP headers, which is used as zero (0) when calculating the annotation offsets.

The annotation entries include the mime-type used to calculate the offset to the annotation. The Generic Charset Conversion Interface is one way to convert the text so you can locate the annotation in the document using the entry's offsets.

#include <stdio.h>
#include <string>
#include <iconv.h>
#include "indri/QueryEnvironment.hpp"
#include "indri/QueryExpander.hpp"
.
.
.
/* Hard code document id for this example. */
string docID="clueweb09-en0000-00-04720";
indri::api::QueryEnvironment *env = new indri::api::QueryEnvironment();
/* Set whether there should be one single background model or context sensitive models; (background true for one background model; false for context sensitive models) */
env->setSingleBackgroundModel(false);
/* Add a local repository using the full pathname to the repository */
env->addIndex("/path/to/index/");
std::vector idList;
idList.push_back(docID.c_str());
/* Fetch all documents with a metadata key docno that matches docID. */
std::vector docIDs = env->documentIDsFromMetadata("docno", idList);
std::vector documents = env->documents(docIDs);

/* Check to see if we found a document */
if( documents.size() ) {
  /* Just take the first document (since docids are unique, there should be only one document) */
  indri::api::ParsedDocument* document = documents[0];
  /* document->getContent() will not include the HTTP headers, therefore we have to grab the complete document which includes the WARC header and content block, HTTP headers and document.*/
  string t = document->text;
  /* Convert document text to correct mime-type - possibly using iconv_open function. */
  .
  .
  .
  /* Find the beginning of the HTTP headers. A WARC record consists of a record header followed by a record content block and two newlines (Newlines are CRLF as per other Internet standards.). The HTTP headers and document follow.*/
  size_t start = correctMimeTypeString.find("\n\n") + 2 ;
  /* We now have start location (zero) of the annotation offsets. You can proceed with finding annotations using the provided offsets. */
  .
  .
  .
} else {
  cout << "document not found" << endl;
}
env->close();

Citation

If you use this data in a publication, please cite it as:

Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag Subramanya, "FACC1: Freebase annotation of ClueWeb corpora, Version 1 (Release date 2013-06-26, Format version 1, Correction level 0)", June 2013.

Please also include in the citation the following URL(s) where the data is available (http://lemurproject.org/clueweb09/ and/or http://lemurproject.org/clueweb12/).

Other Data From Google

If you would like to learn more about data releases from Google, you may wish to consider subscribing to this low-traffic mailing list: http://goo.gl/MJb3A.

Acknowledgments

This data set was prepared by Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag Subramanya (Google), and Juan Caicedo Carvajal (CMU).

Thanks to Johnny Chen, John Giannandrea, Rahul Gupta, Jesse Saba Kirchner, Ruichao Li, Eisar Lipkovitz, Jeremy O'Brien, Dave Orr, Fernando Pereira, Dave Price, and Chuck Wu for making this release possible.

ClueWeb09 Related Data: Freebase Annotations of the ClueWeb Corpora, v1 (FACC1)