ClueWeb12 Redirects:
Redirect Information for the ClueWeb12 dataset.
Data Description
The redirects file is a plain text file with one redirect per line in the form of:
[Source URL] [Destination URL] [Source IP]
- Total number of redirects: 100,671,078
- The file contains 676,031 redirects that have the same source & destination URL but a different source IP address. These entries were created due to the crawler visiting a URL multiple times and that URL was served at multiple ip addresses.
- The file contains 504,497 redirects that have the same source URL & IP address but a different destination URL. These entries were created due to the crawler visiting a URL multiple times. Between visits, the redirection changed.
Data
The Redirects file for the ClueWeb12 crawl can be downloaded here:
-
ClueWeb12_Redirects.txt.bz2: (2.7G Compressed; approximately 17G uncompressed).