Page Rank
The WebGraphs are as provided with the collection. WebGraphs not only include in-collection pages as nodes, but also all the outlinks from those pages. For example, the category A English portion has about 500 million (503,860,525) pages, and the graph includes roughly 4.8 billion (4,780,950,903) URLs/nodes.
It should be noted that the PageRank files available on this page contain duplicate entries. There were cases where DocNOs corresponded to multiple NodeIDs in the WebGraph. Since the PageRank scores are calculated based on NodeIDs, and then the NodeIDs are mapped back to DocNOs, it caused duplicate DocNOs. To correct for this, you should sum over the PageRank scores of all occurrences of the same DocNO.
Category A English portion
- Raw scores size: 3.2GB, compressed. Scores are not normalized, and the sum is 96176607.1109954
- PageRank prior size: 3.2GB, compressed. This list assigns all pages into 10 bins, and assign a log probability value to each bin of URLs. This list can be fed to the makeprior application of Lemur to build the prior into an existing Indri index.
There are about 52% of the DOCNOs in the raw pagerank list that have the default minimum pagerank because there are no inlinks pointing to them in the Web Graph. About 86% of the bottom DOCNOs in the prior file are in the last bin of the 10 pagerank prior bins.
The duplicate record list file contains in each line duplicate DOCNOs that correspond to the same URL. The list also includes prefixes and are in the same format. If the DOCNO is in the file as a complete DOCNO, or if its prefix appears, it's a duplicate. As noted above, only the smallest document number will be included in the PRranked data. There are a small number of DOCNOs that are not in the pagerank data and are not included in the duplicate record list. Malformed or incomplete html and parser read errors caused the DOCNO to not be included node list and therefore not in the webgraph.
Category B
Since Category B is just a subset of A, PageRank scores for Category B documents can be found as a subset of the Category A scores.The following two files contain PageRank scores computed only on the WebGraph of the Category B set.
- Raw scores size: 909MB, compressed. Scores are not normalized, and the sum is 24146270.9505055
- PageRank prior size: 926MB, compressed. This list assigns all pages into 10 bins, and assign a log probability value to each bin of URLs. This list can be fed to the makeprior application of Lemur to build the prior into an existing Indri index.