ClueWeb12 Web Graph
Web Graph Information for the ClueWeb12 dataset.
Data Description
The ClueWeb12 webgraph node ids are organized as follows:
- Node zero - empty node used to support BVGraph format.
- The first 733,019,372 (excluding zero) are node ids from documents that are in the ClueWeb12 corpus.
- The next 245,388,725 nodes are the documents that we crawled but were dropped from the dataset.
- The remaining 5,279,298,498 node ids are outlinks that were not crawled.
Url to Node ID:
An ASCII file that maps the URL to the Node Id of the graph. (Node Id zero is not included in this file because it does not have a URL associated with it.) The file format is: [URL] [Node Id]
Web Graph in BVGraph format:
This version is compatible for use with the WebGraph Toolkit. BVGraph is described by a graph file (with extension .graph), an offset file (with extension .offsets) and a property file (with extension .properties).
Web Graph in ASCII format:
The first line in the file is the total number of nodes (6,257,706,595 nodes). Following is one line of outlinks (target nodes) for each node ID, starting with node 0. Any target node IDs will be space-separated on the line for that node. If a node does not have any outlinks associated with it, the line will be blank.Web Graph Files
The list of unique URLs that correspond to the nodes in the web graph.
-
Clueweb12_url2nodeId.txt.bz2: (73G Compressed; approximately 617G uncompressed). ASCII file that maps the URL to the Node Id of the graph.
-
Clueweb12_url2nodeId.txt.bz2.md5: (4K). Checksum file for the ASCII file that maps the URL to the Node Id of the graph.
WebGraph Toolkit formatted version of the Web Graph.
-
clueweb12.graph: (56G). BVGraph graph file.
-
clueweb12.offsets: (3.6G). BVGraph offsets file.
-
clueweb12.properties: (4.0K). BVGraph properties file.
-
clueweb12.bvgraph.checksums.txt: (4K). An ASCII file containing md5 checksums for the BVGraph files.
ASCII formatted version of the Web Graph.
-
ClueWeb12_WebGraph_v2_0.txt.bz2 : (84G Compressed; approximately 688G uncompressed). A bzip2 comppress file of the Web Graph file in ASCII format.
-
ClueWeb12_WebGraph_v2_0.txt.bz2.md5: (4K). File containing checksum for the compressed text file.
-
ClueWeb12_WebGraph_v2_0.txt.md5: (4K). File containing checksum for the uncompressed text file.
Acknowledgements
The creation of the ClueWeb12 dataset was sponsored by National Science Foundation grant CNS-0934358, under its Community Research Infrastructure program. We thank Sebastiano Vigna and Marc Najork for their significant contributions to the creation of the webgraph. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsors. |