The ClueWeb09 Dataset

The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference.

 

Dataset Specifications

Web Pages:

See the Record Counts Section on the Dataset Information and Sample Files page for detailed information on the distribution of records and languages.

 

Web Graph:

The web graph for both the entire dataset and for the TREC Category B dataset (first 50 million English pages) is complete. We are in the process of retrieving the data and performing the final formatting of the web graph.

Information on how the crawl progressed is also available.

 

Dataset Distribution:

The ClueWeb09 dataset and subsets are distributed in several different ways.

 

Hard Disk Drive (HDD) Details:

Datasets are distributed on standard SATA 6 Gbit/sec 3.5" hard disk drives (HDDs) that should be compatible with most SATA interfaces, including external USB to SATA enclosures.

Disks shipped after June 21, 2023 are in Linux ext4 format. Older disks are in Linux ext3 format.

NOTE: If you will receive data on a disk of 1 TB or greater, check that your system’s operating system and hardware are compatible with large hard disks. Not all SATA external enclosures are compatible with large disks.

 

File Formats and Sample Data:

Web pages are in the WARC file format. Each WARC file is about 1 gigabyte, uncompressed. Each WARC file contains several tens of thousands of web pages (e.g., 40,000). Each WARC file is compressed by gzip.

Please see the Dataset Information and Sample Files page for a detailed description of the contents of the dataset including the format of the dataset and sample files.

 

Online Services

The Lemur Project provides several online services to simplify use of the ClueWeb09 dataset.

Some of these services require a user name and password. If your organization has a license to use the ClueWeb09 dataset, you can obtain a username and password by contacting Jamie Callan.

 

Using A Hosted Copy of the ClueWeb09 Dataset

The ClueWeb09 dataset is available on several 'cloud computer' services (e.g., Open Cloud, the Pittsburgh Supercomputer Center). It can also be accessed using search interfaces provided by the Lemur Project.

A ClueWeb09 dataset license is required before you can begin using a hosted copy of the dataset. There is no cost for a dataset license; it is free.

The process for obtaining a ClueWeb09 dataset license is described below. This process takes two weeks. Please allow enough time. We do not have the power to hurry the university administrators that must approve your license.

  1. Sign an Organizational Agreement . This agreement must be signed by a person with the authority to sign agreements on behalf of your organization. The person signing must also initial each page of the agreement on the bottom right corner.

    The organizational data license typically applies to a single research group or unit within a larger legal entity. For example, in a university, the organizational license might apply to a research group consisting of a few professors, and the students and staff doing research with them. In this case, the organization would be the name of the research group (e.g., the Information Retrieval Laboratory), and the Corporation/Legal Entity would be the name of the university.

  2. Send the complete copy (all six pages) of the signed organizational agreement to Nicole Perrotta at the Language Technologies Institute. The preferred method of sending the signed organization agreement is to scan it into a pdf and email it to Nicole (nperrott at andrew dot cmu dot edu). If you cannot send a pdf, you can fax the agreement. The fax number is +1 412-268-6298. If you choose to fax the organizational agreement, please notify Nicole by email (nperrott at andrew dot cmu dot edu) so that we know to look for your fax.

  3. Each individual who will use or have access to the dataset must sign an Individual Agreement. You must retain these signed individual agreements within your organization.

 

Obtaining a Copy of the ClueWeb09 Dataset

The ClueWeb09 datasets are distributed by Carnegie Mellon University for research purposes only. A dataset may be obtained from Carnegie Mellon by signing a data license agreement with Carnegie Mellon University, and paying a fee that covers the cost of distributing the dataset.

The process for obtaining a ClueWeb09 dataset is described below. This process takes two weeks. Please allow enough time. We do not have the power to hurry the university administrators that must approve your license.

  1. Sign an Organizational Agreement . This agreement must be signed by a person with the authority to sign agreements on behalf of your organization. The person signing must also initial each page of the agreement on the bottom right corner.

    The organizational data license typically applies to a single research group or unit within a larger legal entity. For example, in a university, the organizational license might apply to a research group consisting of a few professors, and the students and staff doing research with them. In this case, the organization would be the name of the research group (e.g., the Information Retrieval Laboratory), and the Corporation/Legal Entity would be the name of the university.

  2. Send the complete copy (all six pages) of the signed organizational agreement to Nicole Perrotta at the Language Technologies Institute. The preferred method of sending the signed organization agreement is to scan it into a pdf and email it to Nicole (nperrott at andrew dot cmu dot edu). If you cannot send a pdf, you can fax the agreement. The fax number is +1 412-268-6298. If you choose to fax the organizational agreement, please notify Nicole by email (nperrott at andrew dot cmu dot edu) so that we know to look for your fax.

  3. Provide order information to Nicole Perrotta by email or fax:

    1. Which version of the dataset you want: Category A, Category B, or Japanese;
    2. Which type of invoice you want: pdf (fast) or paper (slow);
    3. Mail and email address to which the invoice should be sent;
    4. Mailing address and telephone number to which the datasets should be sent (FedEx requires both the shipping address and telephone number);
    5. Preferred shipping method (1 day, 3 day, 1 week); and
    6. Method of payment (wire transfer, check, credit card).

  4. We will send you an email confirmation that we have received your order.

  5. We will send you an invoice for payment, by mail and/or email. The costs of each dataset are shown below.
     

    Item Cost Notes and Explanations
    ClueWeb09
    The full dataset of about 1 billion pages
    (TREC 2009 "Category A" dataset)
    $380 Includes two 3.0 terabyte hard disks
    (Check Compatiability with your operating system and hardware)
    ClueWeb09-T09B
    A subset of about 50 million English pages
    (TREC 2009 "Category B" dataset)
    $185 Includes one 500 gigabyte hard disk
    ClueWeb09-JA
    A subset of about 67 million Japanese pages
    (NTCIR-9 Intent Task dataset)
    $185 Includes one 500 gigabyte hard disk
    ClueWeb09-T11Crowd
    (TREC-2011 Crowdsourcing dataset)
    $0 Web download only
    Shipping (varies) US options: 1 day, 2 day, 7 day
    International options: 1 week

  6. Payment information will be included on the invoice, and should be paid in U.S. dollars only. The dataset will not be shipped until your payment is confirmed. Payment can only be made via check, wire transfer, or credit card.
     
    If you are in a hurry, credit card is the fastest payment method. Please be aware that we are not automatically notified when funds arrive in CMU's bank account. After you make your payment, please notify Nicole Perrotta by email (nperrott at andrew dot cmu dot edu) so that we know to watch for it.

  7. We ship the dataset to the mailing address that you specified.

  8. Each individual who will use or have access to the dataset must sign an Individual Agreement. You must retain these signed individual agreements within your organization.

 

Staying Informed

Additional information about the ClueWeb09 Dataset is available in the ClueWeb09 Information Page. New information is typically posted there.

We also maintain a ClueWeb09 Mailing List, however it is used very rarely. Please note that when you browse to this page, you may receive a warning stating that the security certificate for the domain is invalid. The certificate is not invalid - it is just self-signed by the list maintainers at Carnegie Mellon University. It is safe to accept the certificate.

 

Acknowledgements

The creation of the ClueWeb09 dataset was sponsored by National Science Foundation grant IIS-0841275, under its Cluster Exploratory program. We thank Google and IBM for the use of the CluE computer cluster. We thank Nick Craswell, Dennis Fetterly, Don Metzler, NIST's ITL Retrieval Group, and Yahoo! for their assistance and advice. We thank the Wikimedia Foundation for enabling the inclusion of the English wikipedia. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsors.