The ClueWeb22 Dataset: Obtaining Data
Acquiring a ClueWeb22 dataset is a three-step process.
Obtain data: The complete dataset fills 14 ✕ 18 TB disks, which is expensive to distribute and store, thus it is distributed in different subsets and formats to support the most common uses. Some subsets may be downloaded for free. Others are distributed on disk and require payment. See the table below.
After the license is complete, you will receive email that describes how to select the subset(s) that you need and, if necessary, pay the distribution fee(s).
Complete individual agreements: Each person who will use or have access to the dataset must sign an Individual Agreement. Your organization must retain the completed individual agreements of people while they have access to the dataset.
ClueWeb22 Subsets
| Document Categories |
Document Count |
Document Formats |
Distribution media |
Cost* |
|---|---|---|---|---|
| B, A, L | varies | html, jpg, vdom, txt | Dataset license only | $0 |
| B | 200M | txt | Download (511 GB) | $0 |
| B | 200M | txt | 1 ✕ 1 TB disk | $310 |
| B | 200M | html, txt, vdom, in/out links | 1 ✕ 18 TB disk | $715 |
| B | 200M | jpg | 6 ✕ 18 TB disk | $3,870 |
| A | 2B | html, txt, vdom, in/out links | 8 ✕ 18 TB disk | $4,985 |
| L | 10B | txt, in/out links | 2 ✕ 18 TB & 1 ✕ 8 TB disk | $1,530 |
| TREC-iKAT-2023 | 116M (passages) | txt | Download (26 GB) | $0 |
| TREC-LR-2024-T1 | 50 | txt | Download (0.5 MB) | $0 |
| * Does not include shipping costs | ||||



