The ClueWeb22 Dataset
ClueWeb22 is the newest in the Lemur Project's ClueWeb line of datasets that support research on information retrieval, natural language processing and related human language technologies. This new dataset was developed by the Lemur Project with significant assistance and support from Microsoft Corporation.
The ClueWeb22 dataset has several novel characteristics compared with earlier ClueWeb datasets.
- It is much larger.
- Documents are of higher quality.
- Documents are provided in several formats (HTML, clean text, screen shots).
- Document outlink and inlink data is provided in a convenient format.
- Document page analyses are provided that reveal where on a page text was displayed, and what was near it.
- The dataset includes a large set of crowdsourced queries and shallow relevance assessments (a pseudo search log).
ClueWeb22 is available now.
Dataset Specifications
ClueWeb22 contains 10 billion web pages, organized into three subsets ("categories"), as described below. The different dataset categories are captured in varying detail and formats to control storage and computational costs. For example, ClueWeb22-B may be of interest for developing ranking algorithms, and ClueWeb22-L may be of interest for training large neural language models.
Category | Type of Pages | Size (%) | Pages in English |
Pages in Other Languages |
HTML | Web Graph |
Clean Text |
Semantic Analysis |
Screen Shots |
---|---|---|---|---|---|---|---|---|---|
B | Most popular | 200M (2%) | ~43% | ~57% | ✔ | ✔ | ✔ | ✔ | ✔ |
A | Mostly head pages | 2B (20%) | ~40% | ~60% | ✔ | ✔ | ✔ | ✔ | |
L | Mixed head-tail pages | 10B (100%) | <40% | >60% | ✔ | ✔ |
Many research groups do not want the full dataset in all formats due to its size and cost. Several standard packages are provided to accomodate different research goals and budgets.
Dataset Distribution
Some ClueWeb22 packages (subsets) are small enough to download. The larger packages are distributed on disks. See the How to Get It page for details.
Publications
- A. Overwijk, C. Xiong, X. Liu, C. VandenBerg, and J. Callan. ClueWeb22: 10 billion web documents with visual and semantic information. arXiv:2211.15848. 2022.
- A. Overwijk, C. Xiong, and J. Callan. ClueWeb22: 10 billion web documents with rich information (SIRIP paper). In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM. 2022.
Acknowledgements
ClueWeb22 was created by a collaboration of the Lemur Project and Microsoft Corporation. Document crawling, link extraction, semantic analysis, text extraction, and screen shots were provided by Microsoft. Dataset packaging, creation of crowdsourced queries with relevance assessments, and leaderboard activities were done by the Lemur Project.
The Team | ||
---|---|---|
Carnegie Mellon University | Microsoft | |
Jamie Callan | Chenyan Xiong | |
Cameron VandenBerg | Arnold Overwijk | |
Xiao Lucy Liu |
We thank Jimmy Lin for timely wisdom at important points in the project.
The creation of the ClueWeb22 dataset was sponsored by National Science Foundation grants CNS-1822975 and CNS-1822986. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsor. |