ClueWeb12 Web Graph

Web Graph Information for the ClueWeb12 dataset.

 

Data Description

The ClueWeb12 webgraph node ids are organized as follows:

Url to Node ID:
An ASCII file that maps the URL to the Node Id of the graph. (Node Id zero is not included in this file because it does not have a URL associated with it.) The file format is: [URL] [Node Id]

Web Graph in BVGraph format:
This version is compatible for use with the WebGraph Toolkit. BVGraph is described by a graph file (with extension .graph), an offset file (with extension .offsets) and a property file (with extension .properties).

Web Graph in ASCII format:
The first line in the file is the total number of nodes (6,257,706,595 nodes). Following is one line of outlinks (target nodes) for each node ID, starting with node 0. Any target node IDs will be space-separated on the line for that node. If a node does not have any outlinks associated with it, the line will be blank.


Web Graph Files

The list of unique URLs that correspond to the nodes in the web graph.

WebGraph Toolkit formatted version of the Web Graph.

ASCII formatted version of the Web Graph.


 

Acknowledgements

The creation of the ClueWeb12 dataset was sponsored by National Science Foundation grant CNS-0934358, under its Community Research Infrastructure program. We thank Sebastiano Vigna and Marc Najork for their significant contributions to the creation of the webgraph. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsors.