The ClueWeb22 Dataset

ClueWeb22 is the newest in the Lemur Project's ClueWeb line of datasets that support research on information retrieval, natural language processing and related human language technologies. This new dataset was developed by the Lemur Project with significant assistance and support from Microsoft Corporation.

The ClueWeb22 dataset has several novel characteristics compared with earlier ClueWeb datasets.

ClueWeb22 is available now.

 

Dataset Specifications

ClueWeb22 contains 10 billion web pages, organized into three subsets ("categories"), as described below. The different dataset categories are captured in varying detail and formats to control storage and computational costs. For example, ClueWeb22-B may be of interest for developing ranking algorithms, and ClueWeb22-L may be of interest for training large neural language models.

 

Category Type of Pages Size   (%) Pages in
English
Pages in
Other
Languages
HTML Web
Graph
Clean
Text
Semantic
Analysis
Screen
Shots
B Most popular 200M   (2%) ~43% ~57%
A Mostly head pages 2B  (20%) ~40% ~60%
L Mixed head-tail pages 10B (100%) <40% >60%

 

Many research groups do not want the full dataset in all formats due to its size and cost. Several standard packages are provided to accomodate different research goals and budgets.

 

Dataset Distribution

Some ClueWeb22 packages (subsets) are small enough to download. The larger packages are distributed on disks. See the How to Get It page for details.

 

Publications

 

Acknowledgements

ClueWeb22 was created by a collaboration of the Lemur Project and Microsoft Corporation. Document crawling, link extraction, semantic analysis, text extraction, and screen shots were provided by Microsoft. Dataset packaging, creation of crowdsourced queries with relevance assessments, and leaderboard activities were done by the Lemur Project.

 

The Team
Carnegie Mellon University Microsoft
Jamie Callan Chenyan Xiong
Cameron VandenBerg Arnold Overwijk
  Xiao Lucy Liu

 

We thank Jimmy Lin for timely wisdom at important points in the project.

 

The creation of the ClueWeb22 dataset was sponsored by National Science Foundation grants CNS-1822975 and CNS-1822986. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsor.