The ClueWeb22 Dataset:
Document Details

The ClueWeb22 documents were collected by Microsoft's Bing search engine team with support from Microsoft Research. The documents were reformatted and organized into a research dataset by the Lemur Project.

 

Documents

HTML web pages were sampled from the Bing index based on the distribution used to estimate the importance of each page on the web. Specifically, a page more likely to satisfy potential information needs from search engine users received a higher importance score, thus is assigned higher probability in the sample distribution. Pages that are of low quality were demoted in the sampling and spam pages were filtered. We also ensure the high coverage of the most important part of the web, often referred to as "head" in the web corpus. In total, 10 billion web pages were sampled from the indexed web pages in Bing during the first half of 2022.

Documents are provided in four formats: html, txt, vdom, and jpg. ClueWeb22-B documents are provided in all four formats. ClueWeb22-A and ClueWeb22-L documents are provided in fewer formats, to keep computational and storage costs reasonable.

 

Semantic Analysis

Web pages were analyzed by Microsoft's web page understanding system. It includes a Virtual Rendering component that enhances the HTML with annotations about how an element appears to a person, and a Semantic Annotation system that learns to classify nodes in the annotated HTML DOM tree to recognize important elements. Semantic annotations include header, table, list, title, primary content, and paragraph. Semantic annotations were then used to extract the text contents from the page.

When available, semantic annotations of web pages are stored in the vdom directory hierarchy.

The extracted text contents ("clean text") of web pages are stored in the txt directory hierarchy. The txt files may be especially useful for researchers that want to avoid parsing HTML.

 

Screen Shots

Most pages in ClueWeb22-B are also provided as jpg files ("screen shots") that show how the page would have appeared to a person viewing it on a display with a horizonal resolution of 1024 pixels and an unlimited vertical resolution.

 

Dataset Organization

Each page is stored in one or more formats (html, txt, jpg, vdom). Each format is stored in a separate directory hierarchy to make it easier to distribute different formats to different research groups. The same directory hierarchy structure is used for all page formats.

Pages were organized into segments that each contain approximately 200 million pages. Within a segment, pages were organized into eleven language-oriented streams (de, en, es, fr, it, ja, nl, po, pt, zh_chs, other). Each stream was divided into files that are a maximum size of 5 GB uncompressed (for html, about 20,000 pages per file on average, although the count varies considerably). Files in each stream were organized into subdirectories that each contain up to 100 files.

The dataset is distributed on one or more disks of varying size.

The files on a disk are organized hierarchically, as follows.

 

ClueWeb22 Document Ids

Each page in the dataset has a unique document id, composed as follows:

 
clueweb22-<subdirectory>-<file sequence>-<doc sequence>
 

Note that there is a one-to-one correspondance between a ClueWeb22 document id and its location on disk in the html, txt, vdom, and/or jpg directory trees.

Example: The id and file path for the HTML of the third English page in the dataset are shown below.

 

WARC File Format

The HTML of Web pages are grouped together in files that conform to the WARC ISO 28500 version 1.1 standard ("WARC files"). The WARC file format is described here. WARC files are compressed with gzip. When a WARC file is uncompressed, it requires about 5 GB of storage.

The WARC response header has one custom field:

  1. WARC-TREC-ID: The value is the ClueWeb22 document id defined above.

Documents are stored in UTF-8.

An example is coming soon.

 

Checksum Files

Files with the name "ClueWeb_*.md5" are the md5 sums of the individual WARC files in the dataset. These MD5 sums are in the format:


<md5 checksum hash> <file>

with one line for each file in the dataset.

An example is coming soon.

 

Record Count Files

Files with the name "ClueWeb22_*_counts.txt" are the record counts of each WARC file in the dataset. The record count files are in the format:


<file> <# of records>

with one line for each file in the dataset.

An example is coming soon.

 

Record Counts

Segment# Records
ClueWeb22_00TBD
::
TotalTBD

Record counts are coming soon

 

Summary Statistics

  Size (T)
Subset HTML
B 6.9

More summary statistics are coming soon