The ClueWeb09 Dataset
The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference.
Dataset Specifications
Web Pages:
- 1,040,809,705 web pages, in 10 languages
- 5 TB, compressed. (25 TB, uncompressed.)
Web Graph:
- Entire Dataset:
- Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GB compressed)
- Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB compressed)
- TREC Category B (first 50 million English pages)
- Unique URLs: 428,136,613 (30 GB uncompressed, 10 GB compressed)
- Total Outlinks: 454,075,638 (3 GB uncompressed, 1 GB compressed)
Information on how the crawl progressed is also available.
Dataset Distribution:
The ClueWeb09 dataset and subsets are distributed in several different ways.
Full, 1 x 8 TB: The full dataset is distributed as tarred/gzipped files on one 8 terabyte (TB) hard disk.
CatB, 1 x 500 GB: The TREC "Category B" subset of the full dataset is distributed as tarred/gzipped files on one 500 gigabyte (GB) hard disk.
JA, 1 x 500 GB: The Japanese subset of the full dataset is distributed as tarred/gzipped files on one 500 gigabyte (GB) hard disk.
T11Crowd, web: The subset used by the TREC 2011 Crowdsourcing track is downloaded from the web.
Hard Disk Drive (HDD) Details:
Datasets are distributed on standard SATA 6 Gbit/sec 3.5" hard disk drives (HDDs) that should be compatible with most SATA interfaces, including external USB to SATA enclosures.
Disks shipped after June 21, 2023 are in Linux ext4 format. Older disks are in Linux ext3 format.
NOTE: If you will receive data on a disk of 1 TB or greater, check that your system’s operating system and hardware are compatible with large hard disks. Not all SATA external enclosures are compatible with large disks.
File Formats and Sample Data:
Web pages are in the WARC file format. Each WARC file is about 1 gigabyte, uncompressed. Each WARC file contains several tens of thousands of web pages (e.g., 40,000). Each WARC file is compressed by gzip.
Please see the Dataset Information and Sample Files page for a detailed description of the contents of the dataset including the format of the dataset and sample files.
Online Services
The Lemur Project provides several online services to simplify use of the ClueWeb09 dataset.
-
Batch Query Service for ClueWeb09: Use the Indri search engine to search the ClueWeb09 Category A English or Category B dataset
-
Category A - English Interactive Search: Use the Indri search engine to interactively search the English part of the ClueWeb09 Category A dataset
-
Category B Interactive Search: Use the Indri search engine to interactively search the ClueWeb09 Category B dataset
-
Wikipedia Interactive Search: Use the Indri search engine to interactively search the Wikipedia part of the ClueWeb09 dataset
-
ClueWeb09 Attribute Lookup Service: Fast lookup of ClueWeb09 document attributes.
Using A Hosted Copy of the ClueWeb09 Dataset
The ClueWeb09 dataset is available on several 'cloud computer' services (e.g., Open Cloud, the Pittsburgh Supercomputer Center). It can also be accessed using search interfaces provided by the Lemur Project.
A ClueWeb09 dataset license is required before you can begin using a hosted copy of the dataset. There is no cost for a dataset license; it is free.
The process for obtaining a ClueWeb09 dataset license is described below. This process takes two weeks. Please allow enough time. We do not have the power to hurry the university administrators that must approve your license.
Sign an Organizational Agreement . This agreement must be signed by a person with the authority to sign agreements on behalf of your organization. The person signing must also initial each page of the agreement on the bottom right corner.
The organizational data license typically applies to a single research group or unit within a larger legal entity. For example, in a university, the organizational license might apply to a research group consisting of a few professors, and the students and staff doing research with them. In this case, the organization would be the name of the research group (e.g., the Information Retrieval Laboratory), and the Corporation/Legal Entity would be the name of the university.
Send the complete copy (all six pages) of the signed organizational agreement to Nicole Perrotta at the Language Technologies Institute. The preferred method of sending the signed organization agreement is to scan it into a pdf and email it to Nicole (nperrott at andrew dot cmu dot edu). If you cannot send a pdf, you can fax the agreement. The fax number is +1 412-268-6298. If you choose to fax the organizational agreement, please notify Nicole by email (nperrott at andrew dot cmu dot edu) so that we know to look for your fax.
Each individual who will use or have access to the dataset must sign an Individual Agreement. You must retain these signed individual agreements within your organization.
Obtaining a Copy of the ClueWeb09 Dataset
The ClueWeb09 datasets are distributed by Carnegie Mellon University for research purposes only. A dataset may be obtained from Carnegie Mellon by signing a data license agreement with Carnegie Mellon University, and paying a fee that covers the cost of distributing the dataset.
The process for obtaining a ClueWeb09 dataset is described below. This process takes two weeks. Please allow enough time. We do not have the power to hurry the university administrators that must approve your license.
Sign an Organizational Agreement . This agreement must be signed by a person with the authority to sign agreements on behalf of your organization. The person signing must also initial each page of the agreement on the bottom right corner.
The organizational data license typically applies to a single research group or unit within a larger legal entity. For example, in a university, the organizational license might apply to a research group consisting of a few professors, and the students and staff doing research with them. In this case, the organization would be the name of the research group (e.g., the Information Retrieval Laboratory), and the Corporation/Legal Entity would be the name of the university.
Send the complete copy (all six pages) of the signed organizational agreement to Nicole Perrotta at the Language Technologies Institute. The preferred method of sending the signed organization agreement is to scan it into a pdf and email it to Nicole (nperrott at andrew dot cmu dot edu). If you cannot send a pdf, you can fax the agreement. The fax number is +1 412-268-6298. If you choose to fax the organizational agreement, please notify Nicole by email (nperrott at andrew dot cmu dot edu) so that we know to look for your fax.
Provide order information to Nicole Perrotta by email or fax:
- Which version of the dataset you want: Category A, Category B, or Japanese;
- Which type of invoice you want: pdf (fast) or paper (slow);
- Mail and email address to which the invoice should be sent;
- Mailing address and telephone number to which the datasets should be sent (FedEx requires both the shipping address and telephone number);
- Preferred shipping method (1 day, 3 day, 1 week); and
- Method of payment (wire transfer, check, credit card).
We will send you an email confirmation that we have received your order.
We will send you an invoice for payment, by mail and/or email. The costs of each dataset are shown below.
Item Cost Notes and Explanations ClueWeb09
The full dataset of about 1 billion pages
(TREC 2009 "Category A" dataset)$380 Includes two 3.0 terabyte hard disks
(Check Compatiability with your operating system and hardware)ClueWeb09-T09B
A subset of about 50 million English pages
(TREC 2009 "Category B" dataset)$185 Includes one 500 gigabyte hard disk ClueWeb09-JA
A subset of about 67 million Japanese pages
(NTCIR-9 Intent Task dataset)$185 Includes one 500 gigabyte hard disk ClueWeb09-T11Crowd
(TREC-2011 Crowdsourcing dataset)$0 Web download only Shipping (varies) US options: 1 day, 2 day, 7 day
International options: 1 weekPayment information will be included on the invoice, and should be paid in U.S. dollars only. The dataset will not be shipped until your payment is confirmed. Payment can only be made via check, wire transfer, or credit card.
If you are in a hurry, credit card is the fastest payment method. Please be aware that we are not automatically notified when funds arrive in CMU's bank account. After you make your payment, please notify Nicole Perrotta by email (nperrott at andrew dot cmu dot edu) so that we know to watch for it.We ship the dataset to the mailing address that you specified.
Each individual who will use or have access to the dataset must sign an Individual Agreement. You must retain these signed individual agreements within your organization.
Staying Informed
Additional information about the ClueWeb09 Dataset is available in the ClueWeb09 Information Page. New information is typically posted there.
We also maintain a ClueWeb09 Mailing List, however it is used very rarely. Please note that when you browse to this page, you may receive a warning stating that the security certificate for the domain is invalid. The certificate is not invalid - it is just self-signed by the list maintainers at Carnegie Mellon University. It is safe to accept the certificate.
Acknowledgements
![]() |
The creation of the ClueWeb09 dataset was sponsored by National Science Foundation grant IIS-0841275, under its Cluster Exploratory program. We thank Google and IBM for the use of the CluE computer cluster. We thank Nick Craswell, Dennis Fetterly, Don Metzler, NIST's ITL Retrieval Group, and Yahoo! for their assistance and advice. We thank the Wikimedia Foundation for enabling the inclusion of the English wikipedia. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsors. |