Word Entity Duet: Indexing Documents

The Word Entity Duet project provides an indexing application, which can parse ClueWeb or Wall Street Journal documents into an Elasticsearch index. Entities tagged in each document as described in Tagging Documents are indexed in the entity field for each document. Follow these steps to index.

Indexing Steps

Increase the heap space used by Elasticsearch. In elasticsearch-6.1.2/config/jvm.options, set -Xmx and -Xms to at least 2G (preferably 4g - 16g if possible.)
Start Elasticsearch.
ClueWeb indexing has the option of filtering spam. Spam scores for ClueWeb can be downloaded from the Waterloo Spam Rankings for the ClueWeb09 Dataset page. Filter this spam file to contain only the documents that you would like to be considered spam.
Create a properties file for indexing.
- host.name= Value are localhost, host IPaddress, or hostname
- host.port= Value of host port (Elasticsearch defaults to port 9200)
- host.schema= Host schema (default http)
- index.name= Name of the index to be created (the index will be created if it does not exist or added to if it does.)
- document.type= Values are ClueWeb or WSJ
- data.directory= The directory where the documents are located
- spam.filename= The filename of the spam documents (optional - only available for ClueWeb documents)
- annotations.directory= The directory where the tagged entity annotations are located
Start indexing with this command: java -jar -Xmx4G indexer-1.0-jar-with-dependencies.jar index.properties. Use at least 2G of heap space (preferably 4G - 8G).