The ClueWeb12 Dataset

The ClueWeb12 dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 733,019,372 English web pages, collected between February 10, 2012 and May 10, 2012. ClueWeb12 is a companion or successor to the ClueWeb09 web dataset. Distribution of ClueWeb12 began in January 2013.

 

Information about the progress of the ClueWeb12 Dataset is available here.

 

Exploring the ClueWeb12 Dataset

If you have login credentials for the ClueWeb09 dataset, you may use them for logging into the ClueWeb12-B13 search engine until January 31, 2014. After February 1, 2014, to use this service, you will need to have signed a ClueWeb12 data license agreement with Carnegie Mellon University. There is no charge to use the Lemur Project's ClueWeb12 online services (see below). If you need login credentials, please contact Jamie Callan.

 

Obtaining a Copy of the ClueWeb12 Dataset

The ClueWeb12 datasets are distributed by Carnegie Mellon University for research purposes only. A dataset may be obtained from Carnegie Mellon by signing a data license agreement with Carnegie Mellon University, and paying a fee that covers the cost of distributing the dataset.

It takes 4-6 weeks to obtain a dataset. Please allow enough time. We do not have the power to hurry the university administrators that must approve your license. Once they approve it, we process your payment and ship your disks quickly.

The steps are as follows.

  1. Sign an Organizational Agreement . This agreement must be signed by a person with the authority to sign agreements on behalf of your organization. The person signing must also initial each page of the agreement on the bottom right corner.

    The organizational data license typically applies to a single research group or unit within a larger legal entity. For example, in a university, the organizational license might apply to a research group consisting of a few professors, and the students and staff doing research with them. In this case, the organization would be the name of the research group (e.g., the Information Retrieval Laboratory), and the Corporation/Legal Entity would be the name of the university.

  2. Email all six pages of the completed Agreement and Order form to us (clueweb at andrew dot cmu dot edu). If you cannot send a pdf, please contact us by email to arrange an alternate method.

  3. We will send you an email confirmation that we have received your order.

  4. Carnegie Mellon's Sponsored Projects office will send you an invoice for payment. The costs of the different versions of the dataset are shown below.

    Item
    Document
    Categories
    Document
    Count
    Document
    Formats

    Distribution media

    Cost*
    ClueWeb12-Full 733M html 1 ✕ 8 TB disk $380
    ClueWeb12-B13 50M html 1 ✕ 500 GB disk $185
             * Does not include shipping costs
  5. Payment must be in U.S. dollars.

    Note: We are not automatically notified when funds are deposited to CMU's bank account. After you make your payment, please notify us by email (clueweb at andrew dot cmu dot edu) so that we know to watch for it.

  6. After payment is received, the dataset is shipped to you.

  7. Each individual who will use or have access to the dataset must sign an Individual Agreement. You must retain these signed individual agreements within your organization.

 

Acknowledgements

The creation of the ClueWeb12 dataset was sponsored by National Science Foundation grant CNS-0934358, under its Community Research Infrastructure program. We thank Google for the creation of Freebase annotations for the dataset. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsors.