The ClueWeb22 Dataset: Obtaining a Copy

The ClueWeb22 datasets are distributed by Carnegie Mellon University for research purposes only. A dataset may be obtained by signing a data license agreement with Carnegie Mellon University, and if necessary, paying a fee that covers the cost of distributing the dataset.

It takes 2-4 weeks to obtain a dataset. Please allow enough time. We do not have the power to hurry the university administrators that must approve your license.

The steps are as follows.

  1. Complete the Organization Agreement and Order Form. There are slightly different licenses for different types of organizations.

    The Organization Agreement must be signed by a person with the authority to sign agreements on behalf of your organization. The person signing must also initial each page of the agreement on the bottom right corner.

    The Organization Agreement typically applies to a single research group or unit within a larger legal entity. For example, in a university, the Organization Agreement might apply to a research group consisting of a few professors, and the students and staff doing research with them. In this case, the organization would be the name of the research group (e.g., the Information Retrieval Laboratory), and the Corporation/Legal Entity would be the name of the university.

  2. Email all six pages of the completed Agreement and Order form to us (clueweb at andrew dot cmu dot edu). If you cannot send a pdf, please contact us by email to arrange an alternate method.

  3. We will send you an email confirmation that we have received your order.

  4. The complete dataset is expensive for us to distribute and for you to store. Thus, it is distributed in several different subsets and formats to support the most common uses. Some are available by download. Others are only available on disk, which requires you to pay a distribution fee. Consider which subset(s) meet your needs.

    Document
    Categories
    Document
    Count
    Document
    Formats

    Distribution media

    Cost*
    B, A, L varies html, jpg, vdom, txt Dataset license only $0
    B 200M txt Download (511 GB) $0
    B 200M txt 1 ✕ 1 TB disk $310
    B 200M html, txt, vdom, in/out links 1 ✕ 18 TB disk $715
    B 200M jpg 6 ✕ 18 TB disk $3,870
    A 2B html, txt, vdom, in/out links 8 ✕ 18 TB disk $4,985
    L 10B txt, in/out links 2 ✕ 18 TB & 1 ✕ 8 TB disk $1,530
    TREC-iKAT-2023 116M (passages) txt Download (26 GB) $0
    TREC-LR-2024 38K txt Download (342 MB) $0
             * Does not include shipping costs
    • If you selected a subset that can be downloaded, after your license is signed, download instructions will be sent by email.

    • If you selected a subset that is distributed on disks, after your license is signed, you must pay a distribution fee. Carnegie Mellon's Sponsored Projects office will send you an invoice for payment.

      Payment must be in U.S. dollars.

      Note: We are not automatically notified when funds are deposited to CMU's bank account. After you make your payment, please notify us by email (clueweb at andrew dot cmu dot edu) so that we know to watch for it.

      After payment is received, the dataset is shipped to you.

  5. Each individual who will use or have access to the dataset must sign an Individual Agreement. You must retain these signed individual agreements within your organization.