The ClueWeb22 Dataset: Obtaining Data

Acquiring a ClueWeb22 dataset is a three-step process.

  1. Obtain an organizational license

  2. Obtain data: The complete dataset fills 14 ✕ 18 TB disks, which is expensive to distribute and store, thus it is distributed in different subsets and formats to support the most common uses. Some subsets may be downloaded for free. Others are distributed on disk and require payment. See the table below.

    After the license is complete, you will receive email that describes how to select the subset(s) that you need and, if necessary, pay the distribution fee(s).

  3. Complete individual agreements: Each person who will use or have access to the dataset must sign an Individual Agreement. Your organization must retain the completed individual agreements of people while they have access to the dataset.

ClueWeb22 Subsets

Document
Categories
Document
Count
Document
Formats

Distribution media

Cost*
B, A, L varies html, jpg, vdom, txt Dataset license only $0
B 200M txt Download (511 GB) $0
B 200M txt 1 ✕ 1 TB disk $310
B 200M html, txt, vdom, in/out links 1 ✕ 18 TB disk $715
B 200M jpg 6 ✕ 18 TB disk $3,870
A 2B html, txt, vdom, in/out links 8 ✕ 18 TB disk $4,985
L 10B txt, in/out links 2 ✕ 18 TB & 1 ✕ 8 TB disk $1,530
TREC-iKAT-2023 116M (passages) txt Download (26 GB) $0
TREC-LR-2024-T1 50 txt Download (0.5 MB) $0
         * Does not include shipping costs