Web Graph
Here are the web graphs for the entire ClueWeb09 dataset as well as for the TREC 2009 Category B (first 50 million English pages) subset.If they are not included in the hard disks sent for dataset distribution, they can be downloaded from the links here provided.
The statistics for the web graph are as follows:
- Full Dataset:
- Unique URLs: 4,780,950,903 (325 GB uncompressed, 116 GB compressed)
- Total Outlinks: 7,944,351,835 (71 GB uncompressed, 29 GB compressed)
- TREC Category B (first 50 million English pages)
- Unique URLs: 428,136,613 (30 GB uncompressed, 9.7 GB compressed)
- Total Outlinks: 454,075,638 (3.8 GB uncompressed, 1.6 GB compressed)
Webgraph Format
Full Dataset
The webgraph for the full dataset consists of the following files:- ClueWeb09_WG_NodeList_Full.txt.gz: The list of unique URLs that correspond to the nodes in the web graph. This is the full list in node ID order. The first line is the URL for node 0, the next line is the URL for node 1, etc. (Size of the .gz file: 124,216,628,483 bytes, MD5: 364e464b135d38b6f0074e7c7e5d79e3). NOTE: Sebastiano Vigna from University of Milano reported that the first line of the graph file should be 4780950911, not 4780950903. Correcting this will enable the graph to be digested by ASCIIGraph without problems.
- ClueWeb09_WG_all.graph-txt.gz : The ASCII webgraph of the outlinks for the nodes. The first line in the file is the total number of nodes (4,780,950,903 nodes). Following is one line of outlinks (target nodes) for each node ID, starting with node 0. Any target node IDs will be space-separated on the line for that node. If a node does not have any outlinks associated with it, the line will be blank. Some of the nodeids could be larger than 4,780,950,902, but not much larger. This is because during the crawl more than 1.2 billion pages were crawled, but not all of them were output as the ClueWeb? collection for quality control.
- Unfortunately at this time, we cannot offer a BVGraph formatted version of the full webgraph for use with the WebGraph Toolkit due to the WebGraph's use of 32 bit numbers which limit the package to about 2.1 billion nodes. This may change in the future with a possible 64 bit version.
TREC Category B
The webgraph for the TREC Category B files (first 50 million English pages) consists of the following files:- ClueWeb09_WG_50m_NodeList_Full.txt.gz 9.7GB: The list of unique URLs that correspond to the nodes in the web graph. This is the full list in node ID order. The first line is the URL for node 0, the next line is the URL for node 1, etc.
- ClueB-ID-DOCNO.txt.tar.gz 1.3GB: The list of Node_ID <=> DOCNO pairs. Each line is an integer node ID and the corresponding WARC DOCNO separated by TAB.
- ClueWeb09_WG_50m.graph-txt.gz 1.6GB: The ASCII webgraph of the outlinks for the nodes. The first line in the file is the total number of nodes (428,136,613 nodes). Following is one line of outlinks (target nodes) for each node ID, starting with node 0. Any target node IDs will be space-separated on the line for that node. If a node does not have any outlinks associated with it, the line will be blank. There are more than 50m nodes because outlinks that point outside the 50m are also included as nodes of this graph. However, if it's outside the 50m pages, the node will not have actual page content associated with it, thus there will be 0 outlink for those nodes. For the pages inside the 50m, this way of generating the web graph gives more accurate outlink statistics, and better !PageRank estimates.
- ClueWeb09_WG_50m_bvgraph.tar.gz 1.3GB: The BVGraph formatted version of the ASCII webgraph. This version is compatible for use with the WebGraph Toolkit