Duplicate URLs

Overall there are about 0.2% duplicate documents in the English portion of Category A. These are caused by URLs that have been downloaded more than once. Details can be found in the two files below.

Duplicate URLs in WARC documents

http://boston.lti.cs.cmu.edu/clueweb09/pagerank/dupDOCNOlist.txt

Duplicate URLs in the URLs list

http://boston.lti.cs.cmu.edu/clueweb09/pagerank/dupDOCIDlist.txt