Dataset Information
Table of contents
Dataset Organization
The dataset is organized into segments. Each segment contains approximately 50 million records (web pages). Each segment is stored in a directory named:- ClueWeb09_<language>_<segment #>
where <language> is the language of pages for segment (e.g. English) and <segment #> is the segment number.
Each segment contains a set of directories named:
- <language><directory #>
where <language> is a 2-letter standard language identifier (see Language Identifiers below), and <directory #> is the sequence number for that language.
Each directory contains up to 100 files named:
- <file #>.warc.gz
where <file #> is the sequence number of the file within its directory from "00.warc.gz" up to "99.warc.gz".
Each file contains approximately 40,000 web pages in WARC file format, as described below. An uncompressed file requires about 1 GB of storage.
For example, the first English pages downloaded by the crawler are stored in:
- ClueWeb09_English_1/en0000/00.warc.gz
There is one exception to this format. At the end of the first English segment (ClueWeb09_English_1), there are four directories that contain a complete copy of the English Wikipedia. These directories are named:
- enwp<wikipedia directory #>
where <wikipedia directory #> is 00, 01, 02, or 03.
Dataset Format
Web pages are stored in gzipped files that are in WARC format. The WARC formatting used conforms to the WARC ISO 28500 final draft (as of June 18th, 2008), version 018.Specifications for the format can be found at:
- http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
- http://www.scribd.com/doc/4303719/WARC-ISO-28500-final-draft-v018-Zentveld-080618
One custom field is added to the WARC response header information named "WARC-TREC-ID". This is a globally unique identifier for the dataset that describes the location of the individual record within the entire ClueWeb09 Dataset. The WARC-TREC-ID value is in the format of:
- clueweb09-<directory>-<file>-<record>
The <directory> corresponds to the individual directory as specified in the Dataset Organization section above. It is in the format of <language><directory #> where language is the 2 letter standard language code and the directory number is a 4-digit (padded) directory number in sequence.
The <file> is a 2-digit (padded) number that corresponds to the file number within the <directory>.
The <record> is a 5-digit (padded) number that corresponds to this record's sequence within the individual file.
The WARC-Date field:
This field contains the date that the crawled document was formatted into a WARC record. Most of the formatting occurred in March 2009, although some records (primarily Wikipedia) were formatted in February 2009. The date format is 2009-mm-ddThh:mm:ssZ, where mm is a two digit integer indicating the month; dd is a two digit integer indicating the day of the year (not the day of the month); T (for time) is a character separating the date from the time; hh is a two digit integer indicating the hour; mm is a two digit integer indicating the minute; and ss is a two digit integer indicating the second; Z is a hyphen and four digit number (-0500) indicating the time zone.
For most purposes the date that the document was crawled is probably more useful than the WARC-Date. The date that the document was crawled can be found in the http header.
Checksum Files
The files with the name "ClueWeb_*.md5" are the md5 sums of the individual WARC files in the dataset. These MD5 sums are in the format:- <md5 checksum hash> <file>
with multiple lines in the file - one line for each file in the dataset. For example, the following line:
- 98f91370de2dbc9c6d358f6251e591d6 *./en0000/22.warc.gz
denotes the md5 checksum for the file 22.warc.gz under the en0000 directory.
The checksum files (by segment) are on the individual directories on the disks, or can be download here:
* ClueWeb09_English_1 checksums
* ClueWeb09_English_2 checksums
* ClueWeb09_English_3 checksums
* ClueWeb09_English_4 checksums
* ClueWeb09_English_5 checksums
* ClueWeb09_English_6 checksums
* ClueWeb09_English_7 checksums
* ClueWeb09_English_8 checksums
* ClueWeb09_English_9 checksums
* ClueWeb09_English_10 checksums
* ClueWeb09_Chinese_1 checksums
* ClueWeb09_Chinese_2 checksums
* ClueWeb09_Chinese_3 checksums
* ClueWeb09_Chinese_4 checksums
* ClueWeb09_Spanish_1 checksums
* ClueWeb09_Spanish_2 checksums
* ClueWeb09_Japanese_1 checksums
* ClueWeb09_Japanese_2 checksums
* ClueWeb09_French_1 checksums
* ClueWeb09_German_1 checksums
* ClueWeb09_Italian_1 checksums
* ClueWeb09_Korean_1 checksums
* ClueWeb09_Portuguese_1 checksums
* ClueWeb09_Arabic_1 checksums
Record Counts
The files with the name "ClueWeb09_*_counts.txt" are the record counts by file for the individual WARC files for the dataset. The record count files are in the format of:- <file> <# of records>
With multiple lines in the file - one line for each file in the dataset. For example, the following line:
- *./en0042/15.warc.gz 34618
denotes that the file 15.warc.gz under the en0042 directory has 34,618 individual page records in it.
The record counts (by language) are as follows:
Language | # Records | |
English | 503,903,810 pages | |
Chinese | 177,489,357 pages | |
Spanish | 79,333,950 pages | |
Japanese | 67,337,717 pages | |
French | 50,883,172 pages | |
German | 49,814,309 pages | |
Portuguese | 37,578,858 pages | |
Arabic | 29,192,662 pages | |
Italian | 27,250,729 pages | |
Korean | 18,075,141 pages |
The record counts (by segment on disk) are as follows (note that the individual count files are on the disks, but can also be downloaded here):
Segment Identifier | # Records | Record Count File |
ClueWeb09_English_1 | 50,220,423 pages | ClueWeb09_English_1_counts.txt |
ClueWeb09_English_2 | 51,577,077 pages | ClueWeb09_English_2_counts.txt |
ClueWeb09_English_3 | 50,547,493 pages | ClueWeb09_English_3_counts.txt |
ClueWeb09_English_4 | 52,311,060 pages | ClueWeb09_English_4_counts.txt |
ClueWeb09_English_5 | 50,756,858 pages | ClueWeb09_English_5_counts.txt |
ClueWeb09_English_6 | 50,559,093 pages | ClueWeb09_English_6_counts.txt |
ClueWeb09_English_7 | 52,472,358 pages | ClueWeb09_English_7_counts.txt |
ClueWeb09_English_8 | 49,545,346 pages | ClueWeb09_English_8_counts.txt |
ClueWeb09_English_9 | 50,738,874 pages | ClueWeb09_English_9_counts.txt |
ClueWeb09_English_10 | 45,175,228 pages | ClueWeb09_English_10_counts.txt |
ClueWeb09_Chinese_1 | 50,325,079 pages | ClueWeb09_Chinese_1_counts.txt |
ClueWeb09_Chinese_2 | 49,764,419 pages | ClueWeb09_Chinese_2_counts.txt |
ClueWeb09_Chinese_3 | 50,359,421 pages | ClueWeb09_Chinese_3_counts.txt |
ClueWeb09_Chinese_4 | 27,040,438 pages | ClueWeb09_Chinese_4_counts.txt |
ClueWeb09_Spanish_1 | 49,841,221 pages | ClueWeb09_Spanish_1_counts.txt |
ClueWeb09_Spanish_2 | 29,492,729 pages | ClueWeb09_Spanish_2_counts.txt |
ClueWeb09_Japanese_1 | 50,634,640 pages | ClueWeb09_Japanese_1_counts.txt |
ClueWeb09_Japanese_2 | 16,703,077 pages | ClueWeb09_Japanese_2_counts.txt |
ClueWeb09_French_1 | 50,883,172 pages | ClueWeb09_French_1_counts.txt |
ClueWeb09_German_1 | 49,814,309 pages | ClueWeb09_German_1_counts.txt |
ClueWeb09_Italian_1 | 27,250,729 pages | ClueWeb09_Italian_1_counts.txt |
ClueWeb09_Korean_1 | 18,075,141 pages | ClueWeb09_Korean_1_counts.txt |
ClueWeb09_Portuguese_1 | 37,578,858 pages | ClueWeb09_Portuguese_1_counts.txt |
ClueWeb09_Arabic_1 | 29,192,662 pages | ClueWeb09_Arabic_1_counts.txt |
Language Identifiers
All 2-letter language identifers for the dataset conform to the ISO 639 language ID list. The languages used in the ClueWeb09 dataset are:- en - English
- zh - Chinese
- es - Spanish
- ja - Japanese
- de - German
- fr - French
- ko - Korean
- it - Italian
- pt - Portuguese
- ar - Arabic