The Word Entity Duet project can generate feature vectors, which can be used to train models in RankLib. Features can be generated at the document level using different document fields and different scoring algorithms such as BM25, tf-idf, Boolean And, Boolean Or, and Coordinate Match. Additionally, features using the entity details from the Freebase API in the entity index can be generated. All features are normalized to have values between zero and one.
In order to generate entity features, queries must be tagged with entity IDs which match the entity IDs that have been tagged in the documents. The Word Entity Duet project used TagMe to generate wikipedia entity IDs, but any tagger will work as long as it is consitant with the document tags.
Queries files must contain one query per line with the query ID, the query text, and the tagged entities each separated by three colons (:::). See example query below.1::: Antitrust Cases Pending::: 666256 23004 18963471
To use the sample tagging code, copy the TagMeQueries.java files to the samples directory in the downloaded TagMe project.
Compile the samples using javac.
javac -cp lib/*:libgg/*:ext_lib/*:bin/:samples/ samples/TagMeQueries.java
It is recommended to run TagMe with as much memory as possible to make it run faster. Run TagMe with this command.
java -cp lib/*:ext_lib/*:libgg/*:bin/:samples/ -Xmx128G -Dtagme.config=config.full.xml
TagMeQueries [QUERIES_FILE] [TAGGED_QUERIES_RESULTS_FILE]
The Word Entity Duet project uses the Elasticsearch Learning to Rank plugin to define a featureset. Features must be defined using the Elasticsearch Query DSL. Additionally, users can install scoring plugins to generate features using different scoring algorithms. The scoring plugin algorithms and use are described on the Scoring Plugins page. The Word Entity Duet featuregenerator application automatically uploads defined featuresets to the Elasticsearch instance, which means that users only need to create a json file with the featureset itself, and the featuregenerator application will take care of the rest.
Create a json file to define document features. Here is an example featureset for document features using different document fields and scoring algorithms.
{
"featureset": {
"name": "docfeaturestest2",
"features": [
{
"name": "body_query",
"params": [
"keywords"
],
"template_language": "mustache",
"template": {
"match": {
"body": "{{keywords}}"
}
}
},
{
"name": "title_query",
"params": [
"keywords"
],
"template_language": "mustache",
"template": {
"match": {
"title": "{{keywords}}"
}
}
},
{
"name": "body_cm",
"params": [
"keywords"
],
"template_language": "mustache",
"template": {
"function_score": {
"query": {
"match": {
"body": "{{keywords}}"
}
},
"script_score": {
"script": {
"source": "cm",
"lang": "cmscript",
"params": {
"field": "body",
"query": "{{keywords}}"
}
}
}
}
}
},
{
"name": "body_booland",
"params": [
"keywords"
],
"template_language": "mustache",
"template": {
"function_score": {
"query": {
"match": {
"body": "{{keywords}}"
}
},
"script_score": {
"script": {
"source": "booland",
"lang": "boolandscript",
"params": {
"field": "body",
"query": "{{keywords}}"
}
}
}
}
}
},
{
"name": "body_boolor",
"params": [
"keywords"
],
"template_language": "mustache",
"template": {
"function_score": {
"query": {
"match": {
"body": "{{keywords}}"
}
},
"script_score": {
"script": {
"source": "boolor",
"lang": "boolorscript",
"params": {
"field": "body",
"query": "{{keywords}}"
}
}
}
}
}
},
{
"name": "body_tfidf",
"params": [
"keywords"
],
"template_language": "mustache",
"template": {
"function_score": {
"query": {
"match": {
"body": "{{keywords}}"
}
},
"script_score": {
"script": {
"source": "tfidf",
"lang": "tfidfscript",
"params": {
"field": "body",
"query": "{{keywords}}",
"idfs": "{{idfs}}"
}
}
}
}
}
},
{
"name": "body_entities",
"params": [
"entities"
],
"template_language": "mustache",
"template": {
"match": {
"entities": "{{entities}}"
}
}
},
{
"name": "entities_cm",
"params": [
"entities"
],
"template_language": "mustache",
"template": {
"function_score": {
"query": {
"match": {
"entities": "{{entities}}"
}
},
"script_score": {
"script": {
"source": "cm",
"lang": "cmscript",
"params": {
"field": "entities",
"query": "{{entities}}"
}
}
}
}
}
}
]
}
}
Entity features are created by using the tagged entities from the queries and looking up the entity IDs in the entity index to get entity names, descriptions, and aliases. Entity features must be defined in a separate json file from the document features because each query may have a different number of entities. Scores for all entities for each query are summed or set to zero if there are no entities in the query.
{
"featureset": {
"name": "entitydetailstest",
"features": [
{
"name": "titledetails",
"params": [
"title"
],
"template_language": "mustache",
"template": {
"match": {
"body": "{{title}}"
}
}
},
{
"name": "descriptiondetails",
"params": [
"description"
],
"template_language": "mustache",
"template": {
"match": {
"body": "{{description}}"
}
}
},
{
"name": "aliasdetails",
"params": [
"aliases"
],
"template_language": "mustache",
"template": {
"match": {
"body": "{{aliases}}"
}
}
},
{
"name": "titledetails_title",
"params": [
"title"
],
"template_language": "mustache",
"template": {
"match": {
"title": "{{title}}"
}
}
},
{
"name": "descriptiondetails_title",
"params": [
"description"
],
"template_language": "mustache",
"template": {
"match": {
"title": "{{description}}"
}
}
},
{
"name": "aliasdetails_title",
"params": [
"aliases"
],
"template_language": "mustache",
"template": {
"match": {
"title": "{{aliases}}"
}
}
},
{
"name": "titledetails_cm",
"params": [
"title"
],
"template_language": "mustache",
"template": {
"function_score": {
"query": {
"match": {
"body": "{{title}}"
}
},
"script_score": {
"script": {
"source": "cm",
"lang": "cmscript",
"params": {
"field": "body",
"query": "{{title}}"
}
}
}
}
}
},
{
"name": "descriptiondetails_cm",
"params": [
"description"
],
"template_language": "mustache",
"template": {
"function_score": {
"query": {
"match": {
"body": "{{description}}"
}
},
"script_score": {
"script": {
"source": "cm",
"lang": "cmscript",
"params": {
"field": "body",
"query": "{{description}}"
}
}
}
}
}
},
{
"name": "aliasdetails_cm",
"params": [
"aliases"
],
"template_language": "mustache",
"template": {
"function_score": {
"query": {
"match": {
"body": "{{aliases}}"
}
},
"script_score": {
"script": {
"source": "cm",
"lang": "cmscript",
"params": {
"field": "body",
"query": "{{aliases}}"
}
}
}
}
}
}
]
}
}