Duplicate URLs
Overall there are about 0.2% duplicate documents in the English portion of Category A. These are caused by URLs that have been downloaded more than once. Details can be found in the two files below.
Duplicate URLs in WARC documents
http://boston.lti.cs.cmu.edu/clueweb09/pagerank/dupDOCNOlist.txt
- Each line of the above list contains DOCNOs that actually correspond to the same URL.
- If the DOCNO is a prefix, then the two WARC files corresponding to these two prefixes contain the same list of URLs.
- This list is the result of deduplication of the whole set of Category A documents.
Duplicate URLs in the URLs list
http://boston.lti.cs.cmu.edu/clueweb09/pagerank/dupDOCIDlist.txt
- Each line is a list of node IDs that correspond to the same URL.
- This list is the result of deduplication of the URL list of Category A.