The ClueWeb22 Dataset:
Document Details
The ClueWeb22 documents were collected and processed by Microsoft's Bing search engine team with support from Microsoft Research. The documents were reformatted and organized into a research dataset by the Lemur Project.
Documents
HTML web pages were sampled from the Bing index based on the distribution used to estimate the importance of each page on the web. Specifically, a page more likely to satisfy potential information needs from search engine users received a higher importance score, thus is assigned higher probability in the sample distribution. Pages that are of low quality were demoted in the sampling and spam pages were filtered. We also ensure the high coverage of the most important part of the web, often referred to as "head" in the web corpus. In total, 10 billion web pages were sampled from the indexed web pages in Bing during the first half of 2022.
Documents are provided in six formats: html, outlink, inlink, txt, vdom, and jpg. ClueWeb22-B documents are provided in all six formats. ClueWeb22-A and ClueWeb22-L documents are provided in fewer formats, to keep computational and storage costs reasonable.
Semantic Analysis
Web pages were analyzed by Microsoft's web page understanding system. It includes a Visual Rendering component that enhances the HTML with annotations about how an element appears to a person, and a Semantic Annotation system that learns to classify nodes in the annotated HTML DOM tree to recognize important elements. Semantic annotations include header, table, list, title, primary content, and paragraph. Semantic annotations were then used to extract the text contents from the page.
When available, semantic annotations of web pages are stored in the vdom directory hierarchy.
The extracted text contents ("clean text") of web pages are stored in the txt directory hierarchy. The txt files may be especially useful for researchers that want to avoid parsing HTML.
Screen Shots
Most pages in ClueWeb22-B are also provided as jpg files ("screen shots") that show how the page would have appeared to a person viewing it on a display with a horizonal resolution of 1024 pixels and an unlimited vertical resolution.
Outlinks and Inlinks
Many pages have outlink and inlink data. Outlink data are the url and text associated with the <a href="..."> HTML tags in the page, if any. Inlink data are the urls of other pages that point to the page, with the text enclosed in the <a> and </a> HTML tags.
For example, suppose http://lemurproject.org/clueweb22/index.html contains the following text.
<a href="http://lemurproject.org/clueweb22/docspecs.html">ClueWeb22 dataset format</a>
http://lemurproject.org/clueweb22/index.html has an outlink to http://lemurproject.org/clueweb22/docspecs.html with the text "ClueWeb22 dataset format".
http://lemurproject.org/clueweb22/docspecs.html has an inlink from http://lemurproject.org/clueweb22/index.html with the text "ClueWeb22 dataset format".
Dataset Organization
Each page is stored in one or more formats (html, outlink, inlink, txt, jpg, vdom). Each format is stored in a separate directory hierarchy to make it easier to distribute different formats to different research groups. The same directory hierarchy structure is used for all page formats.
Pages are organized into segments that each contain approximately 200 million pages. Within a segment, pages are organized into eleven language-oriented streams (de, en, es, fr, it, ja, nl, po, pt, zh_chs, other). Each stream is divided into files that are a maximum size of 5 GB uncompressed (for html, about 20,000 pages per file on average, although the count varies considerably). Files in each stream are organized into subdirectories that each contain up to 100 files.
The dataset is distributed on one or more disks of varying size.
The files on a disk are organized hierarchically, as follows.
ClueWeb22/
- format/ One of { html,
outlink,
inlink,
txt,
jpg,
vdom }
- language id/ One of { de, en, es, fr, it, ja, nl, po, pt,
zh_chs, other }
- stream id/ Example stream id: en00. A stream
contains up to 80 subdirectories.
- subdirectory/ Example subdirectory: en0003.
A subdirectory contains up to 100 files.
- file: Example file: en0003-18.warc.gz
Example filepath: ClueWeb22/html/en/en00/en0003/en0003-18.warc.gz
- file: Example file: en0003-18.warc.gz
- subdirectory/ Example subdirectory: en0003.
A subdirectory contains up to 100 files.
- stream id/ Example stream id: en00. A stream
contains up to 80 subdirectories.
- language id/ One of { de, en, es, fr, it, ja, nl, po, pt,
zh_chs, other }
- README.txt: Each disk contains a README.txt file.
- version_<subset>_<n>: Each disk contains an empty file that identifies the subset (B, A, or L) and version of the dataset.
- checksums: Each disk contains a directory of checksums.
- record_counts: Record count files
show the number of documents in each file for all six document formats.
- format/ One of { html, outlink, inlink, txt, jpg, vdom }.
Example filepath: ClueWeb22/record_counts/html/en00_counts.csv.
- format/ One of { html, outlink, inlink, txt, jpg, vdom }.
ClueWeb22 Document Ids
Each page in the dataset has a unique document id, composed as follows:
clueweb22-<subdirectory>-<file sequence>-<doc sequence>
- subdirectory: As defined above. For example, "en0003", "es0102", and "zh_chs0304".
- file sequence: A two digit value from 0 to 99 that uniquely identifies the file in the subdirectory.
- doc sequence: A five digit value from 0 to 99,999 that uniquely identifies the document in the file.
Sequence counts start at 0. For example, the first document in a file is the 0'th document.
There is a one-to-one correspondance between a ClueWeb22 document id and its location on disk in the html, outlink, inlink, txt, vdom, and/or jpg directory trees.
Example:
- The third document in the dataset has id clueweb22-en0000-00-00002.
- Location of its HTML: The third document in ClueWeb22/html/en/en00/en0000/en0000-00.warc.gz
- Location of its clean text: The third document in ClueWeb22/txt/en/en00/en0000/en0000-00.json.gz
Document? Id --> Dataset Subset
ClueWeb22 contains 10 billion web pages organized into three subsets ("categories"). Smaller subsets are contained in larger subsets. ClueWeb22-B ⊂ ClueWeb22-A ⊂ ClueWeb22-L. The subsets that contain a document can be determined from its document id.
- ClueWeb22-B: Document ids of the form clueweb22-<lang_id>00*
- ClueWeb22-A: Document ids of the form clueweb22-<lang_id>01* - clueweb22-<lang_id>09*
- ClueWeb22-L: All document ids
lang_id is a language id such as "de" or "en".
Thus, clueweb22-en0003-18-00000 is in ClueWeb22-B, because it contains "en00". It is also in ClueWeb22-A and ClueWeb22-L, because larger subsets contain smaller subsets.
WARC (.html) File Format
HTML Web pages are stored in files that conform to the WARC ISO 28500 version 1.1 standard ("WARC files").
ClueWeb22 WARC files are stored using record-at-a-time compression (WARC 1.1 Annex D 12.2). Each WARC record is compressed independently using gzip compression. The WARC file is a concatenation of gzipped records. The entire WARC file can be uncompressed by gunzip. Individual WARC records can also be uncompressed individually (discussed below).
WARC files are compressed using gzip. Each compressed WARC file requires up to about 1 GB of storage. Uncompressed files requires up to about 5 GB of storage.
The first WARC record is a 'warcinfo' record that describes the records that follow it. Its format is "header CRLF block CRLF CRLF". The header consists of a WARC version and seven warc-fields, as shown below. The WARC version and warc-fields are terminated by CRLF (CR is shown as ^M in the example below).
WARC/1.1^M WARC-Type: warcinfo^M WARC-Record-ID: <urn:uuid:8a19a16d-03a2-4a38-977a-f96bbcff7b55>^M WARC-Filename: en0000-49.warc.gz^M WARC-Date: 2022-06-29T02:48:18.503403Z^M WARC-Block-Digest: sha1:37KCPH6UPZKBG73A63XAR4LGW3OFHHGO^M Content-Type: application/warc-fields^M Content-Length: 148^M
The block contains four lines, as shown below. Each line is terminated by CRLF (CR is shown as ^M in the example below).
software: warcio^M format: WARC File Format 1.1^M isPartOf: ClueWeb22^M description: The Lemur Project's ClueWeb22 dataset (http://lemurproject.org/)^M
The remaining WARC records are 1-99,999 response records. The response record format is "header CRLF block CRLF CRLF". A header consists of a WARC version, the seven warc-fields shown above, and the nine custom warc-fields shown below. The WARC version and warc-fields are terminated by CRLF.
ClueWeb22-ID: The value is a ClueWeb22 document id, as defined above.
URL-Hash: A hash of the document's URL that may be convenient for use with efficient data structures.
Language: A language id code (e.g., de, en, es).
VDOM-Heading: A space-separated list of DOM tree nodes that are classified as containing headings.
VDOM-List: A space-separated list of DOM tree nodes that are classified as containing lists.
VDOM-Paragraph: A space-separated list of DOM tree nodes that are classified as containing passages.
VDOM-Primary: A space-separated list of DOM tree nodes that are classified as containing the primary document content.
VDOM-Table: A space-separated list of DOM tree nodes that are classified as containing tables.
VDOM-Title: A space-separated list of DOM tree nodes that are classified as containing titles.
The values for the VDOM-* warc-fields correspond to HTML tags with matching attribute values. It is not necessary to produce a DOM tree to use the VDOM-* warc-fields.
The response record content block is the HTML of a ClueWeb22 document.
Example files:
- WARC file: ClueWeb22/html/en/en00/en0003/en0003-18.warc.gz (793,667,816 bytes)
- WARC file checksum: ClueWeb22/html/en/en00/en0003/en0003-18.warc.gz.checksum (32 bytes)
Clean Text (.txt) File Format
Text extracted from HTML Web pages is stored in a simple JSON file format.
The JSON file is a concatenation of 1-99,999 JSON records that each represent one document. Each JSON record (document) is contained on a single line and enclosed within curly braces ('{' and '}'). Each record contains the five fields shown below.
ClueWeb22-ID: The value is a ClueWeb22 document id, as defined above.
URL: The document's URL.
URL-hash: A hash of the document's URL that may be convenient for use with efficient data structures.
Language: A language id code (e.g., de, en, es).
Clean-Text: The text extracted from the HTML.
Clean text (.txt) files are compressed with gzip.
Example files:
- txt file: ClueWeb22/txt/en/en00/en0003/en0003-18.json.gz (51,095,362 bytes)
- txt file checksum: ClueWeb22/txt/en/en00/en0003/en0003-18.json.gz.checksum (32 bytes)
Outlink (.outlink) File Format
Text extracted from outlinks is stored in a simple JSON file format.
The JSON file is a concatenation of 1-99,999 JSON records that each represent one document. Each JSON record (document) is contained on a single line and enclosed within curly braces ('{' and '}'). Each record contains the five fields shown below.
ClueWeb22-ID: The value is a ClueWeb22 document id, as defined above.
url: The document's URL.
urlhash: A hash of the document's URL that may be convenient for use with efficient data structures.
outlinks: [ [anchor1], ... ]
The i'th anchor record is a list of five values: [url, urlhash, anchor text, ?, language]
Outlink (.outlink) files are compressed with gzip.
Example files:
- outlink file: ClueWeb22/outlink/en/en00/en0003/en0003-18.json.gz (291,683,284 bytes)
Inlink (.inlink) File Format
Text extracted from inlinks is stored in a simple JSON file format.
The JSON file is a concatenation of 1-99,999 JSON records that each represent one document. Each JSON record (document) is contained on a single line and enclosed within curly braces ('{' and '}'). Each record contains the five fields shown below.
ClueWeb22-ID: The value is a ClueWeb22 document id, as defined above.
url: The document's URL.
urlhash: A hash of the document's URL that may be convenient for use with efficient data structures.
anchors: [ [anchor1], ... ]
The i'th anchor record is a list of five values: [url, urlhash, anchor text, ?, language]
Inlink (.inlink) files are compressed with gzip.
Example files:
- inlink file: ClueWeb22/inlink/en/en00/en0003/en0003-18.json.gz (45,401,270 bytes)
Visual Document Object Model (.vdom) File Format
Visual Document Object Models (.vdom) are stored in Protocol Buffers ("protobufs") that are stored in ZIP archives. A ZIP archive contains 1-99,999 protobufs. A protobuf stores the annotated document object model (DOM) of a single document. A document's protobuf filename is its ClueWeb22 document id with a ".bin" filetype. For example: clueweb22-en0003-18-00002.bin.
Protobufs are are accessed using the open source ClueWeb22Api, a small software package written in Python. It is available from the Lemur Project's GitHub repository.
Example files:
- vdom file: ClueWeb22/vdom/en/en00/en0003/en0003-18.zip (404,537,251 bytes)
- vdom file checksum: ClueWeb22/vdom/en/en00/en0003/en0003-18.zip.checksum (32 bytes)
Screenshot (.jpg) File Format
Rendered versions of each page ("screenshots") are stored as JPEG images in ZIP archives. A ZIP archive contains 1-99,999 JPEG files that each represent one document rendered by a browser with 1024 horizontal pixels and an infinite number of vertical pixels. A document's filename is its ClueWeb22 document id with a ".jpg" filetype. For example: clueweb22-en0003-18-00002.jpg.
Most pages have screenshots, but some do not.
Example files:
- jpg file: ClueWeb22/jpg/en/en00/en0003/en0003-18.zip
(5,172,801 bytes)
Note: This sample is just the first ten images of en0003-18.zip, due to the size of the complete file.
Compression
Most of the ClueWeb22 file formats are compressed by gzip (.gz). A few file formats are compressed by zip (.zip). Both forms of compression are described below.
Gzip (.gz)
Files with .gz extensions are stored using record-at-a-time gzip compression. Each document is compressed independently. The .gz file is a concatenation of gzipped records, which supports two types of uncompression:
- The entire file can be uncompressed by one gzip command, and
- A single document can be extracted and uncompressed.
Each .gz file has an accompanying .offsets file (e.g., en0003-18.warc.offsets) that supports random access to documents for reranking or other tasks. Each line in an offsets file contains a byte offset from the start of the compressed file to the start of a compressed document. The n'th line contains the byte offset of the n'th compressed document. The last byte offset is a position one character beyond the end of the file. Its only purpose is to enable software find the length of the last compressed document.
Each line contains a 10-digit byte offset terminated by a newline character (11 characters) to support direct access to the line for a specific document.
For example, the first six lines of en0003-18.warc.offset are
shown below.
0000000255
0000009393
0000037586
0000056405
0000060308
0000073201
"0000000255" indicates that the compressed WARC response record for
document en0003-18-00000 (the 0'th document) begins 255 bytes from
the beginning of the compressed WARC file.
"0000073201" indicates that the compressed WARC response record for
document en0003-18-00006 (the 6'th document) begins 73201 bytes from
the beginning of the compressed WARC file.
All .gz files compressed with gzip have a corresponding .offsets file to support random access and extraction of documents. After the compressed bytes are extracted, they may be uncompressed using your favorite gzip-compatible compression library (e.g., Python's Lib/gzip.py).
Example files:
- WARC file: ClueWeb22/html/en/en00/en0003/en0003-18.warc.gz (793,667,816 bytes)
- WARC file checksum: ClueWeb22/html/en/en00/en0003/en0003-18.warc.gz.checksum (32 bytes)
- WARC offset file: ClueWeb22/html/en/en00/en0003/en0003-18.warc.offset (221,056 bytes)
- WARC offset file checksum: ClueWeb22/html/en/en00/en0003/en0003-18.warc.offset.checksum (32 bytes)
Zip (.zip)
Files with .zip extensions are ZIP archives. A ZIP archive contains one or more files that may be compressed, and a directory that identifies the location of each file in the archive. [1] In ClueWeb22 ZIP archives, a file is a record that describes a ClueWeb22 document. Its filename is the ClueWeb22 document id and its filetype indicates the file format. For example: clueweb22-en0003-18-00002.jpg (screenshot) or clueweb22-en0003-18-00002.bin (vdom).
Some ZIP software libraries support the ability to extract just a single file from the archive (without uncompressing the entire archive), thereby supporting random access to documents for reranking or other tasks.
Checksum Files
All data files have a corresponding checksum file that contain the md5 sum of the data file. Each file contains a single, 32-character string that is an MD5 checksum.
Example files:
- WARC file: ClueWeb22/html/en/en00/en0003/en0003-18.warc.gz (793,667,816 bytes)
- WARC file checksum: ClueWeb22/html/en/en00/en0003/en0003-18.warc.gz.checksum (32 bytes)
- WARC offset file: ClueWeb22/html/en/en00/en0003/en0003-18.warc.gz.checksum (32 bytes bytes)
- WARC offset file checksum: ClueWeb22/html/en/en00/en0003/en0003-18.warc.offset.checksum (32 bytes)
Record Count Files
The dataset includes files that indicate the number of document records in each {html, outlink, inlink, txt, vdom, jpg} file. Record count files are in comma separated values (.csv) format, and organized by stream id. They have two fields.
- filename: For example, en003-18.warc.gz.
- number of records: For example, 20095.
In the ClueWeb22-B subset, the html, txt, and vdom formats all have the same record counts, because each document has html, txt, and vdom formats. The outlink, inlink, and jpg record counts may differ because some documents do not have outlinks, inlinks, and/or screen shots. This is also true for the ClueWeb22-A subset.
Example file:
- WARC record counts file: ClueWeb22/record_counts/html/en00_counts.csv (112,368 bytes)
Summary Statistics
Size (T) | |||||||
---|---|---|---|---|---|---|---|
Subset | Documents | HTML | outlink | inlink | txt | vdom | jpg |
B | 200,000,000 | 6.8 | 0.5 | 0.4 | 0.4 | 3.5 | 80.5 |
A | 2,000,000,000 | n/a | |||||
L | 10,000,000,000 | n/a | 8.8 | 2.3 | 20.7 | n/a | n/a |
Dataset Versions (Change Log)
As errors and problems are discovered, the dataset is corrected or improved, which produces slightly different versions of the dataset. The dataset version is indicated by an empty file with the name "version_<subset>_<version_id>" in the root directory. The subset is either B, A, or L. For example, version_L_01.00.
02.01 (May 24, 2023):
An error in the txt/de/ directory structure was corrected. This problem occurs on ClueWeb22_A Disk 6 and ClueWeb22_L Disk 2.
- chmod g+rwx,o+rx txt/de/de01
- delete everything in txt/de/de01 **except** de1
- mv txt/de/de01/de1/* txt/de/de01/
- rmdir txt/de/de01/de1
Unnecessary checksum files were deleted. This problem occurs on ClueWeb22_B.
- rm html/en/en00/en00??/en00??-??.checksum.checksum
- rm html/fr/fr00/fr000?/fr000?-??.warc.gz.checksum.checksum
- rm html/fr/fr00/fr000?/fr000?-??.warc.gz.checksum.checksum.checksum
- rm html/fr/fr00/fr000?/fr000?-??.warc.offset.checksum.checksum
- rm html/fr/fr00/fr000?/fr000?-??.warc.offset.checksum.checksum.checksum
- rm html/ja/ja00/ja0009/ja0009-57.warc.offset~*
Unnecessary checksum files were deleted. This problem occurs on ClueWeb22_B and ClueWeb22_L Disk 2.
- rm txt/de/de00/de0000/de0000-*.checksum.checksum
- rm txt/de/de00/de0003/de0003-*.checksum.checksum
A missing version number file was added to disks that were missing it. This problems occurs on ClueWeb22_B (full), ClueWeb22_A_4, and ClueWeb22_A_5.
- touch version_A_02.01
- touch version_B_02.01
02.00 (March 22, 2023): inlink and outlink data was replaced in B and A; and added to L.
01.00: The original version of the dataset.