To create learning to rank features with entity information, documents must be tagged with entities before they are indexed. The Word Entity Duet project used TagMe for document and query tagging, but any entity tagger can be used as long as the tag output is in the format the indexer can read. To use the Freebase API entity information for entity name, descriptions, and aliases, the tags must be wikipedia IDs.
{
"docno": "[DOCUMENT_ID]"
"tagme": "[WIKIPEDIA_ID_1] [WIKIPEDIA_ID_2]..."
}
If you are interested in using TagMe, the source code can be downloaded from github.
Java samples for tagging ClueWeb and Wall Street Journal documents as well as a list of queries are provided with the Word Entity Duet
Project on sourceforge.
To use the sample tagging code, copy the TagMe*.java files to the samples directory in the downloaded TagMe project.
Compile the samples using javac.
javac -cp lib/*:libgg/*:ext_lib/*:bin/:samples/ samples/TagMe[SCRIPT_NAME].java
It is recommended to run TagMe with as much memory as possible to make it run faster. Run TagMe with this command.
java -cp lib/*:ext_lib/*:libgg/*:bin/:samples/ -Xmx128G -Dtagme.config=config.full.xml
TagMe[SCRIPT_NAME] [DATA_DIRECTORY] [RESULTS_DIRECTORY]
There is an option of tagging only certain documents in the ClueWeb directory since it would take several months to try to tag all ClueWeb
documents using TagMe. We suggest tagging only the top N documents for each query. To tag only certain documents, create a text file with
one document ID to tag on each line. Then run the TagMe application.
java -cp lib/*:ext_lib/*:libgg/*:bin/:samples/ -Xmx128G -Dtagme.config=config.full.xml
TagMeClueWeb [DATA_DIRECTORY] [RESULTS_DIRECTORY] [TOP_N_FILES.txt]