The ClueWeb12 Dataset:
Frequently Asked Questions


We have separated the questions into two sections. The first section has answers to questions related to obtaining the dataset. The second has answers to questions regarding installing the dataset.


Questions about obtaining the dataset


Why is the dataset named ClueWeb12?
Why is the dataset so expensive?
What is the "Category B" subset?
Will you consider modifications to the license?
What if my organization's lawyers insist on modifying the license?
How was the dataset created?

 

 

Why is the dataset named ClueWeb12? The U.S. National Science Foundation's Cluster Exploratory (CluE) program provided computational resources and funding that enabled creation of ClueWeb09, the predecessor to the ClueWeb12 dataset. Although the funding for ClueWeb12 came from a different NSF program, we kept the same name for continuity.

Why is the dataset so expensive? Most of the cost of each dataset covers the hard disk drive(s) used to ship data to you. The hard disk drive(s) is/are yours to keep. The remainder covers the staff time required to process dataset licenses, process invoices, buy disks, copy disks, buy packing materials, and prepare disks for shipping; and a small fee that helps us maintain the hardware used for duplicating disks.

What is the "Category B" subset? The "Category B" data set is a 5% sample of the full dataset (50 million documents). It continues a tradition started by NIST's annual TREC evaluations of information retrieval research of providing a smaller subset of the data to support research groups that are unable to work with the full dataset.

Will you consider modifications to the license? Our license is a slight modification of 'TREC style' licenses that have been used by other organizations for more than a decade to distribute web datasets. It is fairly well-established. The cost of the dataset is kept low in part by not involving university lawyers and senior university administrators any more often than absolutely necessary. Please don't ask us to modify the license.

What if my organization's lawyers insist on modifying the license? We will consider whether your request fixes a flaw that applies to a significant group of organizations. If it does, we will try to resolve the issue fairly quickly. If it does not, we will probably refuse the request. Nearly all of the requests that we receive are minor adjustments to wording or attempts to make the license more favorable to the other organization. We reject those requests.

How was the dataset created? The ClueWeb12 dataset was created by Jamie Callan's research group at Carnegie Mellon University's Language Technologies Institute. The web crawl was done from February 10, 2012 until May 10, 2012.




Questions about installing the dataset


Can you explain what we should check so that we can confirm that the 3TB disk sets are compatible with our operating system and hardware?
I am having problems mounting the dataset disks, what should I do?
What operating system, utilities and software were used to create the dataset disks?
I am having problems getting the data off the shipped disk(s). How can I check the condition of the hard disk to determine if it is faulty?
Can I download the dataset over the internet?

 

 

Can you explain what we should check so that we can confirm that the 3TB disk sets are compatible with our operating system and hardware? If the motherboard supports 48-bit Logical Block Addressing (LBA) then your hardware is compatible. If your computer is using a motherboard that was made in the last 10 years, it is likely it have this support. The BIOS and drive controllers must support greater than 2.199TB partitions.

32-bit LBAs impose a limitation of 2.2TB of addressable storage. Once you have confirmed that the hardware is compatible, the software also needs to be checked. Recent 64-bit versions of Windows, OSX have moved away from 32-bit LBA method of handling data on disks.

Any hardware that you connect to your computer (such as external USB, eSATA, Firewire/IEEE-1394 disk enclosures) must have a driver installed that is supported and compatible with the OS. The software driver and device firmware must support drives larger than 2.2 TB. Most devices that are older than a couple years are likely NOT to be compatible with disk drives larger than 2.2 TB. Check with the manufacturer to confirm that your device supports the larger drives.

Here is a white paper from Hitachi's website in pdf format. High Capacity (~2.2TB) Technology Brief. This white paper has a table which summarizes which Operating Systems recognize of high-capacity drives.

I am having problems mounting the dataset disks, what should I do? Here are a couple of things you can try before contacting us:

  1. If you have a multi-disk set, try mounting a different disk. Although we do our best to provide good disks to you, we have seen drives fail after they have left our facility. If you have difficulty mounting two drives, then it is likely not to be a faulty disk.
  2. Try connecting the disk to the internal SATA bus. This eliminates the potential of failure due to external hardware, drivers and firmware. Be sure that the SATA port that you are connecting the disk to is enabled in the BIOS.

What operating system, utilities and software were used to create the dataset disks? The operating System used is: Linux Fedora release 7 (Moonshine)

The linux command line tools used are:
# fdisk -v
fdisk (util-linux 2.13-pre7)
# parted --version
parted (GNU parted) 1.8.6
mke2fs 1.42 (29-Nov-2011)

As an example, here is the command sequence that we used to format a 3TB disk which was recognized as /dev/sdb on the system (You need to have root access or sudo privileges to perform these operations):

# fdisk -l
.
.
.
Disk /dev/sdb: 3000.6 GB, 3000592982016 bytes
255 heads, 63 sectors/track, 364801 cylinders, total 5860533168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/sdb doesn't contain a valid partition table

# parted /dev/sdb
(parted) mklabel gpt
(parted) unit TB
(parted) mkpart primary 0.00TB 3.00TB
(parted) print
Model: ST3000DM 001-1CH166 (scsi)
Disk /dev/sdb: 3.00TB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 0.00TB 3.00TB 3.00TB primary

(parted) quit
# fdisk -l
.
.
.
WARNING: GPT (GUID Partition Table) detected on '/dev/sdb'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sdb: 3000.6 GB, 3000592982016 bytes
255 heads, 63 sectors/track, 364801 cylinders, total 5860533168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sdb1 1 4294967295 2147483647+ ee GPT

# mkfs -t ext3 -m 0 -T largefile4 /dev/sd1
mke2fs 1.42 (29-Nov-2011)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
715424 inodes, 732566272 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=0
22357 block groups
32768 blocks per group, 32768 fragments per group
32 inodes per group
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
 102400000, 214990848, 512000000, 550731776, 644972544

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done


Commands for mounting the volume:

# mkdir /data
# mount /dev/sdb1 /data
# df -h

Additional information can be found here: Linux Creating a Partition Size Larger Than 2TB



I am having problems getting the data off the shipped disk(s). How can I check the condition of the hard disk to determine if it is faulty? Here are a few things you can try before contacting us:

The drives we ship have a Self-Monitoring, Analysis, and Reporting Technology (SMART) system. SMART hard disks internally monitor their own health and performance. Most SMART systems allow users to perform self-tests. Of course, the tools for using the SMART system will vary depending on the operating system. Ubuntu's "Disk Utility" displays the overall health assessment of the hard disk and if the SMART system is enabled on the hard disk. You can also use the Linux command line tool:

# smartctl -a -A /dev/sdb

You may find this article helpful: The Beginner's Guide to Linux Disk Utilities


Can I download the dataset over the internet? We are sorry but we do not distribute the data by network. Even the B13 dataset is too large for this. We distribute the dataset to many organizations and Carnegie Mellon's networking would not appreciate us taking up their bandwidth with this.




You are welcome to contact David Pane at: callan@cs.cmu.edu with any additional questions or for help with problems with the dataset disks. To help David quickly resolve your issue, please explain your problem in as much detail as possible.