The ClueWeb22 Dataset:
Document Details

The ClueWeb22 documents were collected and processed by Microsoft's Bing search engine team with support from Microsoft Research. The documents were reformatted and organized into a research dataset by the Lemur Project.

 

Documents

HTML web pages were sampled from the Bing index based on the distribution used to estimate the importance of each page on the web. Specifically, a page more likely to satisfy potential information needs from search engine users received a higher importance score, thus is assigned higher probability in the sample distribution. Pages that are of low quality were demoted in the sampling and spam pages were filtered. We also ensure the high coverage of the most important part of the web, often referred to as "head" in the web corpus. In total, 10 billion web pages were sampled from the indexed web pages in Bing during the first half of 2022.

Documents are provided in six formats: html, outlink, inlink, txt, vdom, and jpg. ClueWeb22-B documents are provided in all six formats. ClueWeb22-A and ClueWeb22-L documents are provided in fewer formats, to keep computational and storage costs reasonable.

 

Semantic Analysis

Web pages were analyzed by Microsoft's web page understanding system. It includes a Visual Rendering component that enhances the HTML with annotations about how an element appears to a person, and a Semantic Annotation system that learns to classify nodes in the annotated HTML DOM tree to recognize important elements. Semantic annotations include header, table, list, title, primary content, and paragraph. Semantic annotations were then used to extract the text contents from the page.

When available, semantic annotations of web pages are stored in the vdom directory hierarchy.

The extracted text contents ("clean text") of web pages are stored in the txt directory hierarchy. The txt files may be especially useful for researchers that want to avoid parsing HTML.

 

Screen Shots

Most pages in ClueWeb22-B are also provided as jpg files ("screen shots") that show how the page would have appeared to a person viewing it on a display with a horizonal resolution of 1024 pixels and an unlimited vertical resolution.

 

Outlinks and Inlinks

Many pages have outlink and inlink data. Outlink data are the url and text associated with the <a href="..."> HTML tags in the page, if any. Inlink data are the urls of other pages that point to the page, with the text enclosed in the <a> and </a> HTML tags.

For example, suppose http://lemurproject.org/clueweb22/index.html contains the following text.

<a href="http://lemurproject.org/clueweb22/docspecs.html">ClueWeb22 dataset format</a>

http://lemurproject.org/clueweb22/index.html has an outlink to http://lemurproject.org/clueweb22/docspecs.html with the text "ClueWeb22 dataset format".

http://lemurproject.org/clueweb22/docspecs.html has an inlink from http://lemurproject.org/clueweb22/index.html with the text "ClueWeb22 dataset format".

 

Dataset Organization

Each page is stored in one or more formats (html, outlink, inlink, txt, jpg, vdom). Each format is stored in a separate directory hierarchy to make it easier to distribute different formats to different research groups. The same directory hierarchy structure is used for all page formats.

Pages are organized into segments that each contain approximately 200 million pages. Within a segment, pages are organized into eleven language-oriented streams (de, en, es, fr, it, ja, nl, po, pt, zh_chs, other). Each stream is divided into files that are a maximum size of 5 GB uncompressed (for html, about 20,000 pages per file on average, although the count varies considerably). Files in each stream are organized into subdirectories that each contain up to 100 files.

The dataset is distributed on one or more disks of varying size.

The files on a disk are organized hierarchically, as follows.

ClueWeb22/

 

ClueWeb22 Document Ids

Each page in the dataset has a unique document id, composed as follows:

 
clueweb22-<subdirectory>-<file sequence>-<doc sequence>
 

Sequence counts start at 0. For example, the first document in a file is the 0'th document.

There is a one-to-one correspondance between a ClueWeb22 document id and its location on disk in the html, outlink, inlink, txt, vdom, and/or jpg directory trees.

Example:

 

Document? Id --> Dataset Subset

ClueWeb22 contains 10 billion web pages organized into three subsets ("categories"). Smaller subsets are contained in larger subsets. ClueWeb22-B ⊂ ClueWeb22-A ⊂ ClueWeb22-L. The subsets that contain a document can be determined from its document id.

lang_id is a language id such as "de" or "en".

Thus, clueweb22-en0003-18-00000 is in ClueWeb22-B, because it contains "en00". It is also in ClueWeb22-A and ClueWeb22-L, because larger subsets contain smaller subsets.

 

WARC (.html) File Format

HTML Web pages are stored in files that conform to the WARC ISO 28500 version 1.1 standard ("WARC files").

ClueWeb22 WARC files are stored using record-at-a-time compression (WARC 1.1 Annex D 12.2). Each WARC record is compressed independently using gzip compression. The WARC file is a concatenation of gzipped records. The entire WARC file can be uncompressed by gunzip. Individual WARC records can also be uncompressed individually (discussed below).

WARC files are compressed using gzip. Each compressed WARC file requires up to about 1 GB of storage. Uncompressed files requires up to about 5 GB of storage.

The first WARC record is a 'warcinfo' record that describes the records that follow it. Its format is "header CRLF block CRLF CRLF". The header consists of a WARC version and seven warc-fields, as shown below. The WARC version and warc-fields are terminated by CRLF (CR is shown as ^M in the example below).

WARC/1.1^M
WARC-Type: warcinfo^M
WARC-Record-ID: <urn:uuid:8a19a16d-03a2-4a38-977a-f96bbcff7b55>^M
WARC-Filename: en0000-49.warc.gz^M
WARC-Date: 2022-06-29T02:48:18.503403Z^M
WARC-Block-Digest: sha1:37KCPH6UPZKBG73A63XAR4LGW3OFHHGO^M
Content-Type: application/warc-fields^M
Content-Length: 148^M

The block contains four lines, as shown below. Each line is terminated by CRLF (CR is shown as ^M in the example below).

software: warcio^M
format: WARC File Format 1.1^M
isPartOf: ClueWeb22^M
description: The Lemur Project's ClueWeb22 dataset (http://lemurproject.org/)^M

The remaining WARC records are 1-99,999 response records. The response record format is "header CRLF block CRLF CRLF". A header consists of a WARC version, the seven warc-fields shown above, and the nine custom warc-fields shown below. The WARC version and warc-fields are terminated by CRLF.

  1. ClueWeb22-ID: The value is a ClueWeb22 document id, as defined above.

  2. URL-Hash: A hash of the document's URL that may be convenient for use with efficient data structures.

  3. Language: A language id code (e.g., de, en, es).

  4. VDOM-Heading: A space-separated list of DOM tree nodes that are classified as containing headings.

  5. VDOM-List: A space-separated list of DOM tree nodes that are classified as containing lists.

  6. VDOM-Paragraph: A space-separated list of DOM tree nodes that are classified as containing passages.

  7. VDOM-Primary: A space-separated list of DOM tree nodes that are classified as containing the primary document content.

  8. VDOM-Table: A space-separated list of DOM tree nodes that are classified as containing tables.

  9. VDOM-Title: A space-separated list of DOM tree nodes that are classified as containing titles.

The values for the VDOM-* warc-fields correspond to HTML tags with matching attribute values. It is not necessary to produce a DOM tree to use the VDOM-* warc-fields.

The response record content block is the HTML of a ClueWeb22 document.

Example files:

 

Clean Text (.txt) File Format

Text extracted from HTML Web pages is stored in a simple JSON file format.

The JSON file is a concatenation of 1-99,999 JSON records that each represent one document. Each JSON record (document) is contained on a single line and enclosed within curly braces ('{' and '}'). Each record contains the five fields shown below.

Clean text (.txt) files are compressed with gzip.

Example files:

 

Outlink (.outlink) File Format

Text extracted from outlinks is stored in a simple JSON file format.

The JSON file is a concatenation of 1-99,999 JSON records that each represent one document. Each JSON record (document) is contained on a single line and enclosed within curly braces ('{' and '}'). Each record contains the five fields shown below.

Outlink (.outlink) files are compressed with gzip.

Example files:

 

Inlink (.inlink) File Format

Text extracted from inlinks is stored in a simple JSON file format.

The JSON file is a concatenation of 1-99,999 JSON records that each represent one document. Each JSON record (document) is contained on a single line and enclosed within curly braces ('{' and '}'). Each record contains the five fields shown below.

Inlink (.inlink) files are compressed with gzip.

Example files:

 

Visual Document Object Model (.vdom) File Format

Visual Document Object Models (.vdom) are stored in Protocol Buffers ("protobufs") that are stored in ZIP archives. A ZIP archive contains 1-99,999 protobufs. A protobuf stores the annotated document object model (DOM) of a single document. A document's protobuf filename is its ClueWeb22 document id with a ".bin" filetype. For example: clueweb22-en0003-18-00002.bin.

Protobufs are are accessed using the open source ClueWeb22Api, a small software package written in Python. It is available from the Lemur Project's GitHub repository.

Example files:

 

Screenshot (.jpg) File Format

Rendered versions of each page ("screenshots") are stored as JPEG images in ZIP archives. A ZIP archive contains 1-99,999 JPEG files that each represent one document rendered by a browser with 1024 horizontal pixels and an infinite number of vertical pixels. A document's filename is its ClueWeb22 document id with a ".jpg" filetype. For example: clueweb22-en0003-18-00002.jpg.

Most pages have screenshots, but some do not.

Example files:

 

Compression

Most of the ClueWeb22 file formats are compressed by gzip (.gz). A few file formats are compressed by zip (.zip). Both forms of compression are described below.

 

Gzip (.gz)

Files with .gz extensions are stored using record-at-a-time gzip compression. Each document is compressed independently. The .gz file is a concatenation of gzipped records, which supports two types of uncompression:

Each .gz file has an accompanying .offsets file (e.g., en0003-18.warc.offsets) that supports random access to documents for reranking or other tasks. Each line in an offsets file contains a byte offset from the start of the compressed file to the start of a compressed document. The n'th line contains the byte offset of the n'th compressed document. The last byte offset is a position one character beyond the end of the file. Its only purpose is to enable software find the length of the last compressed document.

Each line contains a 10-digit byte offset terminated by a newline character (11 characters) to support direct access to the line for a specific document.

For example, the first six lines of en0003-18.warc.offset are shown below.

0000000255
0000009393
0000037586
0000056405
0000060308
0000073201

"0000000255" indicates that the compressed WARC response record for document en0003-18-00000 (the 0'th document) begins 255 bytes from the beginning of the compressed WARC file.

"0000073201" indicates that the compressed WARC response record for document en0003-18-00006 (the 6'th document) begins 73201 bytes from the beginning of the compressed WARC file.

All .gz files compressed with gzip have a corresponding .offsets file to support random access and extraction of documents. After the compressed bytes are extracted, they may be uncompressed using your favorite gzip-compatible compression library (e.g., Python's Lib/gzip.py).

Example files:

 

Zip (.zip)

Files with .zip extensions are ZIP archives. A ZIP archive contains one or more files that may be compressed, and a directory that identifies the location of each file in the archive. [1] In ClueWeb22 ZIP archives, a file is a record that describes a ClueWeb22 document. Its filename is the ClueWeb22 document id and its filetype indicates the file format. For example: clueweb22-en0003-18-00002.jpg (screenshot) or clueweb22-en0003-18-00002.bin (vdom).

Some ZIP software libraries support the ability to extract just a single file from the archive (without uncompressing the entire archive), thereby supporting random access to documents for reranking or other tasks.

 

Checksum Files

All data files have a corresponding checksum file that contain the md5 sum of the data file. Each file contains a single, 32-character string that is an MD5 checksum.

Example files:

 

Record Count Files

The dataset includes files that indicate the number of document records in each {html, outlink, inlink, txt, vdom, jpg} file. Record count files are in comma separated values (.csv) format, and organized by stream id. They have two fields.

In the ClueWeb22-B subset, the html, txt, and vdom formats all have the same record counts, because each document has html, txt, and vdom formats. The outlink, inlink, and jpg record counts may differ because some documents do not have outlinks, inlinks, and/or screen shots. This is also true for the ClueWeb22-A subset.

Example file:

 

Summary Statistics

    Size (T)
Subset Documents HTML outlink inlink txt vdom jpg
B 200,000,000 6.8 0.5 0.4 0.4 3.5 80.5
A 2,000,000,000 n/a
L 10,000,000,000 n/a 8.8 2.3 20.7 n/a n/a

 

Dataset Versions (Change Log)

As errors and problems are discovered, the dataset is corrected or improved, which produces slightly different versions of the dataset. The dataset version is indicated by an empty file with the name "version_<subset>_<version_id>" in the root directory. The subset is either B, A, or L. For example, version_L_01.00.