The ClueWeb09 Dataset
The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference.
Dataset Specifications
Web Pages:
- 1,040,809,705 web pages, in 10 languages
- 5 TB, compressed. (25 TB, uncompressed.)
Web Graph:
- Entire Dataset:
- Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GB compressed)
- Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB compressed)
- TREC Category B (first 50 million English pages)
- Unique URLs: 428,136,613 (30 GB uncompressed, 10 GB compressed)
- Total Outlinks: 454,075,638 (3 GB uncompressed, 1 GB compressed)
Information on how the crawl progressed is also available.
Dataset Distribution:
The ClueWeb09 dataset and subsets are distributed in several different ways.
Full, 1 x 8 TB: The full dataset is distributed as tarred/gzipped files on one 8 terabyte (TB) hard disk.
CatB, 1 x 500 GB: The TREC "Category B" subset of the full dataset is distributed as tarred/gzipped files on one 500 gigabyte (GB) hard disk.
JA, 1 x 500 GB: The Japanese subset of the full dataset is distributed as tarred/gzipped files on one 500 gigabyte (GB) hard disk.
T11Crowd, web: The subset used by the TREC 2011 Crowdsourcing track is downloaded from the web.
Hard Disk Drive (HDD) Details:
Datasets are distributed on standard SATA 6 Gbit/sec 3.5" hard disk drives (HDDs) that should be compatible with most SATA interfaces, including external USB to SATA enclosures.
Disks shipped after June 21, 2023 are in Linux ext4 format. Older disks are in Linux ext3 format.
NOTE: If you will receive data on a disk of 1 TB or greater, check that your system’s operating system and hardware are compatible with large hard disks. Not all SATA external enclosures are compatible with large disks.
File Formats and Sample Data:
Web pages are in the WARC file format. Each WARC file is about 1 gigabyte, uncompressed. Each WARC file contains several tens of thousands of web pages (e.g., 40,000). Each WARC file is compressed by gzip.
Please see the Dataset Information and Sample Files page for a detailed description of the contents of the dataset including the format of the dataset and sample files.
Online Services
The Lemur Project provides several online services to simplify use of the ClueWeb09 dataset.
-
Batch Query Service for ClueWeb09: Use the Indri search engine to search the ClueWeb09 Category A English or Category B dataset
-
Category A - English Interactive Search: Use the Indri search engine to interactively search the English part of the ClueWeb09 Category A dataset
-
Category B Interactive Search: Use the Indri search engine to interactively search the ClueWeb09 Category B dataset
-
Wikipedia Interactive Search: Use the Indri search engine to interactively search the Wikipedia part of the ClueWeb09 dataset
-
ClueWeb09 Attribute Lookup Service: Fast lookup of ClueWeb09 document attributes.
Using A Hosted Copy of the ClueWeb09 Dataset
The ClueWeb09 dataset is available on several 'cloud computer' services (e.g., Open Cloud, the Pittsburgh Supercomputer Center). It can also be accessed using search interfaces provided by the Lemur Project.
A ClueWeb09 dataset license is required before you can begin using a hosted copy of the dataset. There is no cost for a dataset license; it is free.
It takes about 4 weeks to obtain a dataset license. Please allow enough time. We do not have the power to hurry the university administrators that must approve your license. Once they approve it, we process your request quickly.
The steps are as follows.
Sign an Organizational Agreement . This agreement must be signed by a person with the authority to sign agreements on behalf of your organization. The person signing must also initial each page of the agreement on the bottom right corner.
The organizational data license typically applies to a single research group or unit within a larger legal entity. For example, in a university, the organizational license might apply to a research group consisting of a few professors, and the students and staff doing research with them. In this case, the organization would be the name of the research group (e.g., the Information Retrieval Laboratory), and the Corporation/Legal Entity would be the name of the university.
Email all six pages of the completed Agreement and Order form to us (clueweb at andrew dot cmu dot edu). If you cannot send a pdf, please contact us by email to arrange an alternate method.
We will send you an email confirmation that we have received your order.
Each individual who will use or have access to the dataset must sign an Individual Agreement. You must retain these signed individual agreements within your organization.
Obtaining a Copy of the ClueWeb09 Dataset
The ClueWeb09 datasets are distributed by Carnegie Mellon University for research purposes only. A dataset may be obtained from Carnegie Mellon by signing a data license agreement with Carnegie Mellon University, and paying a fee that covers the cost of distributing the dataset.
It takes 4-6 weeks to obtain a dataset. Please allow enough time. We do not have the power to hurry the university administrators that must approve your license. Once they approve it, we process your payment and ship your disks quickly.
The steps are as follows.
Sign an Organizational Agreement . This agreement must be signed by a person with the authority to sign agreements on behalf of your organization. The person signing must also initial each page of the agreement on the bottom right corner.
The organizational data license typically applies to a single research group or unit within a larger legal entity. For example, in a university, the organizational license might apply to a research group consisting of a few professors, and the students and staff doing research with them. In this case, the organization would be the name of the research group (e.g., the Information Retrieval Laboratory), and the Corporation/Legal Entity would be the name of the university.
Email all six pages of the completed Agreement and Order form to us (clueweb at andrew dot cmu dot edu). If you cannot send a pdf, please contact us by email to arrange an alternate method.
We will send you an email confirmation that we have received your order.
Carnegie Mellon's Sponsored Projects office will send you an invoice for payment. The costs of the different versions of the dataset are shown below.
Document
CategoriesDocument
CountDocument
Formats
Distribution media
Cost*ClueWeb09
(TREC 2009 "Category A"1B html 1 ✕ 8 TB disk $380 ClueWeb09-T09B
(TREC 2009 "Category B" dataset)50M html 1 ✕ 500GB disk $185 ClueWeb09-JA
(NTCIR-9 Intent Task dataset)
(Japanese language documents)67M html 1 ✕ 500GB disk $185 ClueWeb09-T11Crowd
(TREC-2011 Crowdsourcing dataset)html Download $0 * Does not include shipping costs Payment must be in U.S. dollars.
Note: We are not automatically notified when funds are deposited to CMU's bank account. After you make your payment, please notify us by email (clueweb at andrew dot cmu dot edu) so that we know to watch for it.After payment is received, the dataset is shipped to you.
Each individual who will use or have access to the dataset must sign an Individual Agreement. You must retain these signed individual agreements within your organization.
Staying Informed
Additional information about the ClueWeb09 Dataset is available in the ClueWeb09 Information Page. New information is typically posted there.
Acknowledgements
The creation of the ClueWeb09 dataset was sponsored by National Science Foundation grant IIS-0841275, under its Cluster Exploratory program. We thank Google and IBM for the use of the CluE computer cluster. We thank Nick Craswell, Dennis Fetterly, Don Metzler, NIST's ITL Retrieval Group, and Yahoo! for their assistance and advice. We thank the Wikimedia Foundation for enabling the inclusion of the English wikipedia. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsors. |