Main Page | Namespace List | Class Hierarchy | Class List | File List | Namespace Members | Class Members | File Members | Related Pages

Harvestlinks Utility

The HarvestLinks application extracts all links (and link text) from a collection of web pages. It can be used to gather anchor text and in-links for HTML and TREC Web data. The default file class is trecweb. To process WARC files, such as those distributed with ClueWeb09, use the optional parameter class. This in turn can be added to an index in the form of "inlink" fields for use for direct retrieval or for page-rank calculations.

The two required parameters for the harvestlinks application are:

For example, running this from the command line might look like:

./harvestlinks -corpus=/path/to/corpus -output=/path/to/output

Once you have gathered your links, you must tell the indexer to index them along with your source data. In your index parameter file, you should add the following to your <corpus> parameter set:<br />

<inlink>/path/to/output/sorted</inlink>
<br /> (where the "sorted" directory is the directory named "sorted" under the output directory for harvestlinks). And also, so that the indexer knows about the inlink fields:<br />
<field><name>inlink</name></field>
<br /> This will allow you to perform retrieval tasks on the anchor text.

Harvestlinks Parameters


Generated on Tue Jun 15 11:02:58 2010 for Lemur by doxygen 1.3.4