The ClueWeb22 Dataset

ClueWeb22 is the newest in the Lemur Project's ClueWeb line of datasets that support research on information retrieval, natural language processing and related human language technologies. This new dataset was developed by the Lemur Project with significant assistance and support from Microsoft Corporation.

The ClueWeb22 dataset has several novel characteristics compared with earlier ClueWeb datasets.

It is much larger.
Documents are of higher quality.
Documents are provided in several formats (HTML, clean text, screen shots).
Document outlink and inlink data is provided in a convenient format.
Document page analyses are provided that reveal where on a page text was displayed, and what was near it.
The dataset includes a large set of crowdsourced queries and shallow relevance assessments (a pseudo search log).

ClueWeb22 is available now.

Dataset Specifications

ClueWeb22 contains 10 billion web pages, organized into three subsets ("categories"), as described below. The different dataset categories are captured in varying detail and formats to control storage and computational costs. For example, ClueWeb22-B may be of interest for developing ranking algorithms, and ClueWeb22-L may be of interest for training large neural language models.

Category	Type of Pages	Size (%)	Pages in English	Pages in Other Languages	HTML	Web Graph	Clean Text	Semantic Analysis	Screen Shots
B	Most popular	200M (2%)	~43%	~57%	✔	✔	✔	✔	✔
A	Mostly head pages	2B (20%)	~40%	~60%	✔	✔	✔	✔
L	Mixed head-tail pages	10B (100%)	<40%	>60%		✔	✔

Many research groups do not want the full dataset in all formats due to its size and cost. Several standard packages are provided to accomodate different research goals and budgets.

Dataset Distribution

Some ClueWeb22 packages (subsets) are small enough to download. The larger packages are distributed on disks. See the How to Get It page for details.

Publications

A. Overwijk, C. Xiong, X. Liu, C. VandenBerg, and J. Callan. ClueWeb22: 10 billion web documents with visual and semantic information. arXiv:2211.15848. 2022.
A. Overwijk, C. Xiong, and J. Callan. ClueWeb22: 10 billion web documents with rich information (SIRIP paper). In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM. 2022.

Acknowledgements

ClueWeb22 was created by a collaboration of the Lemur Project and Microsoft Corporation. Document crawling, link extraction, semantic analysis, text extraction, and screen shots were provided by Microsoft. Dataset packaging, creation of crowdsourced queries with relevance assessments, and leaderboard activities were done by the Lemur Project.

The Team
Carnegie Mellon University	Microsoft
Jamie Callan	Chenyan Xiong
Cameron VandenBerg	Arnold Overwijk
	Xiao Lucy Liu

We thank Jimmy Lin for timely wisdom at important points in the project.

The creation of the ClueWeb22 dataset was sponsored by National Science Foundation grants CNS-1822975 and CNS-1822986. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsor.