The ClueWeb09 Dataset:
Frequently Asked Questions
Why is the dataset so expensive?
What is the "Category B" subset?
Will you consider modifications to the license?
What if my organization's lawyers insist on modifying the license?
How was the dataset created?
Why is the dataset named ClueWeb09? The U.S. National Science Foundation's Cluster Exploratory (CluE) program provided computational resources and funding that enabled creation of the dataset. The data was gathered from the web in 2009.
Why is the dataset so expensive? Most of the cost of each dataset covers the hard disk drive(s) used to ship data to you. The hard disk drive(s) is/are yours to keep. The remainder covers the staff time required to process dataset licenses, process invoices, buy disks, copy disks, buy packing materials, and prepare disks for shipping; and a small fee that helps us maintain the hardware used for duplicating disks.
What is the "Category B" subset? The TREC2009 "Category B" data set is the data from the directory "ClueWeb09_English_1" from the entire dataset. This is roughly the first 50 million documents of the English corpus.
Will you consider modifications to the license? Our license is a slight modification of 'TREC style' licenses that have been used by other organizations for more than a decade to distribute web datasets. It is fairly well-established. The cost of the dataset is kept low in part by not involving university lawyers and senior university administrators any more often than absolutely necessary. Please don't ask us to modify the license.
What if my organization's lawyers insist on modifying the license? We will consider whether your request fixes a flaw that applies to a significant group of organizations. If it does, we will try to resolve the issue fairly quickly. If it does not, we will probably refuse the request. Nearly all of the requests that we receive are minor adjustments to wording or attempts to make the license more favorable to the other organization. We reject those requests.
How was the dataset created? The ClueWeb09 dataset was created by Jamie Callan's research group at Carnegie Mellon University's Language Technologies Institute. The web crawl was done in January and February of 2009. Please see our project planning document. It is outdated and wrong in some places - plans changed a little during the crawl - but it is still approximately correct. For information about the Sapphire web crawler, please see the Sapphire FAQ.