ClueWeb12 Remove Duplicate Records
Download ClueWeb12-RemoveduplicateRecords.tgz (11G)
Extract the content using the following command:
$ tar -zxvf ClueWeb12-RemoveduplicateRecords.tgz
ClueWeb12-RemoveduplicateRecords/
ClueWeb12-RemoveduplicateRecords/software/
ClueWeb12-RemoveduplicateRecords/software/WarcRecord.java
ClueWeb12-RemoveduplicateRecords/software/META-INF/
ClueWeb12-RemoveduplicateRecords/software/META-INF/MANIFEST.MF
ClueWeb12-RemoveduplicateRecords/software/RemoveClueWeb12DuplicateRecords.java
ClueWeb12-RemoveduplicateRecords/software/Makefile
ClueWeb12-RemoveduplicateRecords/software/RemoveClueWeb12DuplicateRecords.class
ClueWeb12-RemoveduplicateRecords/software/WarcRecord$WarcHeader.class
ClueWeb12-RemoveduplicateRecords/software/WarcRecord.class
ClueWeb12-RemoveduplicateRecords/software/RemoveClueWeb12DuplicateRecords.jar
ClueWeb12-RemoveduplicateRecords/ClueWeb12DuplicateRecordsToRemove.tgz
ClueWeb12-RemoveduplicateRecords/checksums.tgz
ClueWeb12-RemoveduplicateRecords/ClueWeb12_DocID_to_URI/
ClueWeb12-RemoveduplicateRecords/ClueWeb12_DocID_to_URI/ClueWeb12_Disk1_DocID_To_URL.txt.bz2
ClueWeb12-RemoveduplicateRecords/ClueWeb12_DocID_to_URI/ClueWeb12_Disk2_DocID_To_URL.txt.bz2
ClueWeb12-RemoveduplicateRecords/ClueWeb12_DocID_to_URI/ClueWeb12_Disk3_DocID_To_URL.txt.bz2
ClueWeb12-RemoveduplicateRecords/ClueWeb12_DocID_to_URI/ClueWeb12_Disk4_DocID_To_URL.txt.bz2
ClueWeb12-RemoveduplicateRecords/recordcounts.tgz
ClueWeb12-RemoveduplicateRecords/README.txt