Harvestlinks Utility

The HarvestLinks application extracts all links (and link text) from a collection of web pages. It can be used to gather anchor text and in-links for HTML and TREC Web data. The default file class is trecweb. To process WARC files, such as those distributed with ClueWeb09, use the optional parameter class. This in turn can be added to an index in the form of "inlink" fields for use for direct retrieval or for page-rank calculations.

The two required parameters for the harvestlinks application are:

corpus: The path to the directory holding the corpus files you're trying to index
output: The path to a directory where the link harvesting output should go

For example, running this from the command line might look like:


  ./harvestlinks -corpus=/path/to/corpus -output=/path/to/output

Once you have gathered your links, you must tell the indexer to index them along with your source data. In your index parameter file, you should add the following to your <corpus> parameter set:


  <inlink>/path/to/output/sorted</inlink>

(where the "sorted" directory is the directory named "sorted" under the output directory for harvestlinks). And also, so that the indexer knows about the inlink fields:


  <field><name>inlink</name></field>

This will allow you to perform retrieval tasks on the anchor text.

Harvestlinks Parameters

corpus: (required) The path to the directory holding the corpus files you're trying to index
class: (optional) The file class of the corpus. One of trecweb (the default) or warc.
output: (required) The path to a directory where the link harvesting output should go
redirect: specifies a redirect file that maps from source to target URLs to create aliases for links. The redirect file is a text file with one entry per line in the form of: [SOURCE_URL] [TARGET_URL] Where the source URL is the original URL to be found and the target URL will be what is searched for instead of the original source URL.
mergethreads: specified the number of threads to use for the file sort and merge operations (default 4, recommended less than 8 max.)
delete: set to false to not delete any existing directories in the output directory (default true: do delete)
harvest: perform the harvesting step (default true, set to false to skip)
sort: perform the sorting/merge step (default true, set to false to skip)
clean: perform cleaning of temporary files after sort (default true, set to false to skip)
combine: perfom final combination of links (default true, set to false to skip)

Generated on Tue Jun 15 11:02:58 2010 for Lemur by

1.3.4