Word Entity Duet: Generating Features

The Word Entity Duet project can generate feature vectors, which can be used to train models in RankLib. Features can be generated at the document level using different document fields and different scoring algorithms such as BM25, tf-idf, Boolean And, Boolean Or, and Coordinate Match. Additionally, features using the entity details from the Freebase API in the entity index can be generated. All features are normalized to have values between zero and one.

Tagging Queries

In order to generate entity features, queries must be tagged with entity IDs which match the entity IDs that have been tagged in the documents. The Word Entity Duet project used TagMe to generate wikipedia entity IDs, but any tagger will work as long as it is consitant with the document tags.

Queries files must contain one query per line with the query ID, the query text, and the tagged entities each separated by three colons (:::). See example query below.

1::: Antitrust Cases Pending::: 666256 23004 18963471

To use the sample tagging code, copy the TagMeQueries.java files to the samples directory in the downloaded TagMe project.

Compile the samples using javac.

javac -cp lib/*:libgg/*:ext_lib/*:bin/:samples/ samples/TagMeQueries.java

It is recommended to run TagMe with as much memory as possible to make it run faster. Run TagMe with this command.

java -cp lib/*:ext_lib/*:libgg/*:bin/:samples/ -Xmx128G -Dtagme.config=config.full.xml 
  TagMeQueries [QUERIES_FILE] [TAGGED_QUERIES_RESULTS_FILE]

Defining features

The Word Entity Duet project uses the Elasticsearch Learning to Rank plugin to define a featureset. Features must be defined using the Elasticsearch Query DSL. Additionally, users can install scoring plugins to generate features using different scoring algorithms. The scoring plugin algorithms and use are described on the Scoring Plugins page. The Word Entity Duet featuregenerator application automatically uploads defined featuresets to the Elasticsearch instance, which means that users only need to create a json file with the featureset itself, and the featuregenerator application will take care of the rest.

Document features

Create a json file to define document features. Here is an example featureset for document features using different document fields and scoring algorithms.

{
  "featureset": {
    "name": "docfeaturestest2",
    "features": [
      {
        "name": "body_query",
        "params": [
          "keywords"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "body": "{{keywords}}"
          }
        }
      },
      {
        "name": "title_query",
        "params": [
          "keywords"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "title": "{{keywords}}"
          }
        }
      },
      {
        "name": "body_cm",
        "params": [
          "keywords"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "body": "{{keywords}}"
              }
            },
            "script_score": {
              "script": {
                "source": "cm",
                "lang": "cmscript",
                "params": {
                  "field": "body",
                  "query": "{{keywords}}"
                }
              }
            }
          }
        }
      },
      {
        "name": "body_booland",
        "params": [
          "keywords"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "body": "{{keywords}}"
              }
            },
            "script_score": {
              "script": {
                "source": "booland",
                "lang": "boolandscript",
                "params": {
                  "field": "body",
                  "query": "{{keywords}}"
                }
              }
            }
          }
        }
      },
      {
        "name": "body_boolor",
        "params": [
          "keywords"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "body": "{{keywords}}"
              }
            },
            "script_score": {
              "script": {
                "source": "boolor",
                "lang": "boolorscript",
                "params": {
                  "field": "body",
                  "query": "{{keywords}}"
                }
              }
            }
          }
        }
      },
      {
        "name": "body_tfidf",
        "params": [
          "keywords"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "body": "{{keywords}}"
              }
            },
            "script_score": {
              "script": {
                "source": "tfidf",
                "lang": "tfidfscript",
                "params": {
                  "field": "body",
                  "query": "{{keywords}}",
                  "idfs": "{{idfs}}"
                }
              }
            }
          }
        }
      },
      {
        "name": "body_entities",
        "params": [
          "entities"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "entities": "{{entities}}"
          }
        }
      },
      {
        "name": "entities_cm",
        "params": [
          "entities"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "entities": "{{entities}}"
              }
            },
            "script_score": {
              "script": {
                "source": "cm",
                "lang": "cmscript",
                "params": {
                  "field": "entities",
                  "query": "{{entities}}"
                }
              }
            }
          }
        }
      }
    ]
  }
}

Entity Features

Entity features are created by using the tagged entities from the queries and looking up the entity IDs in the entity index to get entity names, descriptions, and aliases. Entity features must be defined in a separate json file from the document features because each query may have a different number of entities. Scores for all entities for each query are summed or set to zero if there are no entities in the query.

{
  "featureset": {
    "name": "entitydetailstest",
    "features": [
      {
        "name": "titledetails",
        "params": [
          "title"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "body": "{{title}}"
          }
        }
      },
      {
        "name": "descriptiondetails",
        "params": [
          "description"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "body": "{{description}}"
          }
        }
      },
      {
        "name": "aliasdetails",
        "params": [
          "aliases"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "body": "{{aliases}}"
          }
        }
      },
      {
        "name": "titledetails_title",
        "params": [
          "title"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "title": "{{title}}"
          }
        }
      },
      {
        "name": "descriptiondetails_title",
        "params": [
          "description"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "title": "{{description}}"
          }
        }
      },
      {
        "name": "aliasdetails_title",
        "params": [
          "aliases"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "title": "{{aliases}}"
          }
        }
      },
      {
        "name": "titledetails_cm",
        "params": [
          "title"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "body": "{{title}}"
              }
            },
            "script_score": {
              "script": {
                "source": "cm",
                "lang": "cmscript",
                "params": {
                  "field": "body",
                  "query": "{{title}}"
                }
              }
            }
          }
        }
      },
      {
        "name": "descriptiondetails_cm",
        "params": [
          "description"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "body": "{{description}}"
              }
            },
            "script_score": {
              "script": {
                "source": "cm",
                "lang": "cmscript",
                "params": {
                  "field": "body",
                  "query": "{{description}}"
                }
              }
            }
          }
        }
      },
      {
        "name": "aliasdetails_cm",
        "params": [
          "aliases"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "body": "{{aliases}}"
              }
            },
            "script_score": {
              "script": {
                "source": "cm",
                "lang": "cmscript",
                "params": {
                  "field": "body",
                  "query": "{{aliases}}"
                }
              }
            }
          }
        }
      }
    ]
  }
}

Entity Indexing Steps

Increase the heap space used by Elasticsearch. In elasticsearch-6.1.2/config/jvm.options, set -Xmx and -Xms to at least 2G (preferably 4g - 16g if possible.)
Start Elasticsearch.
Download or create qrels file for queries.
Create a properties file for feature generation.
- host.name= Value are localhost, host IPaddress, or hostname
- host.port= Value of host port (Elasticsearch defaults to port 9200)
- host.schema= Host schema (default http)
- index.name= Name of the index which store the documents you are querying
- entity.index.name= Name of the index with the entity title, description, and aliases information
- queries.filename= Name of the file that contains tagged queriers
- qrels.filename= Name of the file with the relevance judgements
- results.filename= Name of the file where the results will be written
- document.featureset= Name of the file which contains the document features
- entities.featureset= Name of the file which contains the entity features
Start indexing with this command: java -jar -Xmx4G featuregenerator.jar featuregenerator.properties. Use at least 2G of heap space (preferably 4G - 8G).
Use the results file to train a model using RankLib