The ClueWeb12 Dataset

The ClueWeb12 dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 870,043,929 English web pages, collected between February 10, 2012 and May 10, 2012. ClueWeb12 is a companion or successor to the ClueWeb09 web dataset. Distribution of ClueWeb12 began in January 2013.

 

Information about the progress of the ClueWeb12 Dataset is available here.

 

Exploring the ClueWeb12 Dataset

Research groups that have a license to use the ClueWeb09 dataset are invited to try out ClueWeb12-B13 dataset. A 5% sample (approximately) is available for interactive search with the Indri search engine. If you already have login credentials for the ClueWeb09 dataset, you may use them for logging into the ClueWeb12-B13 search engine. If you need login credentials, please contact David Pane or Jamie Callan.

 

Obtaining a Copy of the ClueWeb12 Dataset

The ClueWeb12 datasets are distributed by Carnegie Mellon University for research purposes only. A dataset may be obtained from Carnegie Mellon by signing a data license agreement with Carnegie Mellon University, and paying a fee that covers the cost of distributing the dataset.

The process for obtaining a ClueWeb12 dataset is described below.

  1. Sign an Organizational Agreement . This agreement must be signed by a person with the authority to sign agreements on behalf of your organization. The person signing must also initial each page of the agreement on the bottom right corner.

    The organizational data license typically applies to a single research group or unit within a larger legal entity. For example, in a university, the organizational license might apply to a research group consisting of a few professors, and the students and staff doing research with them. In this case, the organization would be the name of the research group (e.g., the Information Retrieval Laboratory), and the Corporation/Legal Entity would be the name of the university.

  2. Fax the complete copy (all five pages) of the signed organizational agreement to Dana Houston at the Language Technologies Institute. The fax number is +1 412-268-6298. After you have faxed the organizational agreement, please notify Dana by email (dhouston at cs dot cmu dot edu) so that we know to look for your fax. If you prefer, instead of faxing the agreement you may scan it into a pdf file and email it to Dana (dhouston at cs dot cmu dot edu).

  3. Provide order information to Dana Houston by email or fax:

    1. Which version of the dataset you want: the full dataset or the "TREC 2013 Category B" subset;
    2. Which type of invoice you want: pdf (fast) or paper (slow);
    3. Mail and email address to which the invoice should be sent;
    4. Mailing address and telephone number to which the datasets should be sent (FedEx requires both the shipping address and telephone number);
    5. Preferred shipping method (1 day, 3 day, 1 week); and
    6. Method of payment (wire transfer, check, credit card).

  4. We will send you an email confirmation that we have received your order.

  5. We will send you an invoice for payment, by mail and/or email. The costs of each dataset are shown below.
     

    Item Cost Notes and Explanations
    ClueWeb12-Full
    The full dataset of about 870 million pages
    $430 Includes two 3.0 terabyte hard disks
    (Check Compatiability with your operating system and hardware)
    ClueWeb12-Full
    The full dataset of about 870 million pages
    $615 Includes four 2.0 terabyte hard disks
    ClueWeb12-B13
    A subset of about 50 million pages
    (TREC 2013 "Category B" dataset)
    $180 Includes one 500 gigabyte hard disk
    Shipping (varies) US options: 1 day, 2 day, 7 day
    International options: 1 week

  6. Payment information will be included on the invoice, and should be paid in U.S. dollars only. The dataset will not be shipped until your payment is confirmed. Payment can only be made via check, wire transfer, or credit card.
     
    If you are in a hurry, credit card is the fastest payment method. Please be aware that we are not automatically notified when funds arrive in CMU's bank account. After you make your payment, please notify Dana Houston by email (dhouston at cs dot cmu dot edu) so that we know to watch for it.

  7. We ship the dataset to the mailing address that you specified.

  8. Each individual who will use or have access to the dataset must sign an Individual Agreement. You must retain these signed individual agreements within your organization.

 

Acknowledgements

The creation of the ClueWeb12 dataset was sponsored by National Science Foundation grant CNS-0934358, under its Community Research Infrastructure program. We thank Google for the creation of Freebase annotations for the dataset. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsors.