ClueWeb09 Related Data:
Freebase Annotations of the ClueWeb Corpora, v1 (FACC1)

Researchers at Google annotated English-language Web pages from the ClueWeb09 and ClueWeb12 corpora. The annotation process was automatic, and hence imperfect. However, the annotations are of generally high quality, as they strove for high precision (and, by necessity, lower recall). For each entity they recognized with high confidence, they provide the beginning and end byte offsets of the entity mention in the input text, its Freebase identifier (mid), and two confidence levels (computed differently, see below).

You might consider using this data in conjunction with the recently released Freebase annotations of several TREC query sets.

 

Data Description

The annotations for each corpus are provided as a collection of 500 files (the partition into individual files is somewhat arbitrary). Each file contains annotations of multiple web pages, and each page URL is followed by a list of entities identified in that page.

Here is an excerpt from an annotation file:

clueweb09-en0000-00-04720.html

PDF 21089 21092 0.99763662 6.6723776e-05 /m/0600q
FDA 21303 21306 0.9998256 0.00057182228 /m/032mx
Food and Drug Administration 21312 21340 0.9998256 0.00057182228 /m/032mx

In this example,

Some documents do not have any annotations (and are not included in the annotation files) because no Freebase entity was recognized in them with high confidence.

There are 340,451,982 documents in ClueWeb09 and 456,498,584 documents in ClueWeb12 with at least one entity annotated. On average, ClueWeb09 documents have 15 entity mentions annotated, and ClueWeb12 documents have 13 mentions annotated. Additional statistics are available in the companion files "ClueWeb09_stats.txt" and "ClueWeb12_stats.txt".

Due to the sheer size of the data, it was not possible to verify all the automatic annotations manually. Based on a small-scale human evaluation, the precision (at the currently chosen threshold) is believed to be around 80-85%. Estimating the recall is of course difficult; however, it is believed to be around 70-85%.

 

Processed Data

The following files are the annotations provided by Google, Inc and processed by The Lemur Project. The processed files are text files with the extension .anns.tsv. The annotations are converted to the a format that we believe is easier to process. Each .anns.tsv file contains the annotations for the documents corresponding to the warc.gz file. For instance, Disk1/ClueWeb09_English_1/en0000/00.anns.tsv contains the annotations for the documents contained in the corpus file Disk1/ClueWeb09_English_1/en0000/00.warc.gz. The file format adds two additional columns to the beginning of each line:

The following is an example of the processed data format:

clueweb09-en0000-00-00005 UTF-8 G e 9188 9196 1.000000 0.000067 /m/03bnb
clueweb09-en0000-00-00005 UTF-8 COM 10850 10853 0.960598 0.000052 /m/01gj0k
clueweb09-en0000-00-00005 UTF-8 US 10856 10858 0.662877 0.005153 /m/09c7w0
clueweb09-en0000-00-00005 UTF-8 Domain Names 10871 10883 0.999885 0.000565 /m/09y1k
clueweb09-en0000-00-00011 ISO-8859-1 American 8115 8123 0.985868 0.005697 /m/09c7w0
clueweb09-en0000-00-00011 ISO-8859-1 US 14201 14203 0.985868 0.005697 /m/09c7w0
clueweb09-en0000-00-00011 ISO-8859-1 US 16075 16077 0.985868 0.005697 /m/09c7w0

Note: You can extract the content of the following files ( compressed tar file) using the command: tar -zxvf ClueWeb09_English_*

 

Using the annotations data with Indri Query Environment

The follow code snippet is an example on how to use the Indri Query Environment to grab a ClueWeb document using (a) an Indri index with storeDocs=true (ParsedDocument data structure is stored with the index), and (b) a document id. The code also shows how to find the beginning of the HTTP headers, which is used as zero (0) when calculating the annotation offsets.

The annotation entries include the mime-type used to calculate the offset to the annotation. The Generic Charset Conversion Interface is one way to convert the text so you can locate the annotation in the document using the entry's offsets.


#include <stdio.h>
#include <string>
#include <iconv.h>
#include "indri/QueryEnvironment.hpp"
#include "indri/QueryExpander.hpp"
.
.
.
/* Hard code document id for this example. */
string docID="clueweb09-en0000-00-04720";
indri::api::QueryEnvironment *env = new indri::api::QueryEnvironment();
/* Set whether there should be one single background model or context sensitive models; (background true for one background model; false for context sensitive models) */
env->setSingleBackgroundModel(false);
/* Add a local repository using the full pathname to the repository */
env->addIndex("/path/to/index/");
std::vector idList;
idList.push_back(docID.c_str());
/* Fetch all documents with a metadata key docno that matches docID. */
std::vector docIDs = env->documentIDsFromMetadata("docno", idList);
std::vector documents = env->documents(docIDs);

/* Check to see if we found a document */
if( documents.size() ) {
  /* Just take the first document (since docids are unique, there should be only one document) */
  indri::api::ParsedDocument* document = documents[0];
  /* document->getContent() will not include the HTTP headers, therefore we have to grab the complete document which includes the WARC header and content block, HTTP headers and document.*/
  string t = document->text;
  /* Convert document text to correct mime-type - possibly using iconv_open function. */
  .
  .
  .
  /* Find the beginning of the HTTP headers. A WARC record consists of a record header followed by a record content block and two newlines (Newlines are CRLF as per other Internet standards.). The HTTP headers and document follow.*/
  size_t start = correctMimeTypeString.find("\n\n") + 2 ;
  /* We now have start location (zero) of the annotation offsets. You can proceed with finding annotations using the provided offsets. */
  .
  .
  .
} else {
  cout << "document not found" << endl;
}
env->close();

 

Citation

If you use this data in a publication, please cite it as:

Please also include in the citation the following URL(s) where the data is available (http://lemurproject.org/clueweb09/ and/or http://lemurproject.org/clueweb12/).

 

Other Data From Google

If you would like to learn more about data releases from Google, you may wish to consider subscribing to this low-traffic mailing list: http://goo.gl/MJb3A.

 

Acknowledgments

This data set was prepared by Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag Subramanya (Google), and Juan Caicedo Carvajal (CMU).

Thanks to Johnny Chen, John Giannandrea, Rahul Gupta, Jesse Saba Kirchner, Ruichao Li, Eisar Lipkovitz, Jeremy O'Brien, Dave Orr, Fernando Pereira, Dave Price, and Chuck Wu for making this release possible.