Word Entity Duet: Generating Features

The Word Entity Duet project can generate feature vectors, which can be used to train models in RankLib. Features can be generated at the document level using different document fields and different scoring algorithms such as BM25, tf-idf, Boolean And, Boolean Or, and Coordinate Match. Additionally, features using the entity details from the Freebase API in the entity index can be generated. All features are normalized to have values between zero and one.

Tagging Queries

In order to generate entity features, queries must be tagged with entity IDs which match the entity IDs that have been tagged in the documents. The Word Entity Duet project used TagMe to generate wikipedia entity IDs, but any tagger will work as long as it is consitant with the document tags.

Queries files must contain one query per line with the query ID, the query text, and the tagged entities each separated by three colons (:::). See example query below.
1::: Antitrust Cases Pending::: 666256 23004 18963471

To use the sample tagging code, copy the TagMeQueries.java files to the samples directory in the downloaded TagMe project.

Compile the samples using javac.

javac -cp lib/*:libgg/*:ext_lib/*:bin/:samples/ samples/TagMeQueries.java
It is recommended to run TagMe with as much memory as possible to make it run faster. Run TagMe with this command.
java -cp lib/*:ext_lib/*:libgg/*:bin/:samples/ -Xmx128G -Dtagme.config=config.full.xml 
  TagMeQueries [QUERIES_FILE] [TAGGED_QUERIES_RESULTS_FILE]

Defining features

The Word Entity Duet project uses the Elasticsearch Learning to Rank plugin to define a featureset. Features must be defined using the Elasticsearch Query DSL. Additionally, users can install scoring plugins to generate features using different scoring algorithms. The scoring plugin algorithms and use are described on the Scoring Plugins page. The Word Entity Duet featuregenerator application automatically uploads defined featuresets to the Elasticsearch instance, which means that users only need to create a json file with the featureset itself, and the featuregenerator application will take care of the rest.

Document features

Create a json file to define document features. Here is an example featureset for document features using different document fields and scoring algorithms.

{
  "featureset": {
    "name": "docfeaturestest2",
    "features": [
      {
        "name": "body_query",
        "params": [
          "keywords"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "body": "{{keywords}}"
          }
        }
      },
      {
        "name": "title_query",
        "params": [
          "keywords"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "title": "{{keywords}}"
          }
        }
      },
      {
        "name": "body_cm",
        "params": [
          "keywords"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "body": "{{keywords}}"
              }
            },
            "script_score": {
              "script": {
                "source": "cm",
                "lang": "cmscript",
                "params": {
                  "field": "body",
                  "query": "{{keywords}}"
                }
              }
            }
          }
        }
      },
      {
        "name": "body_booland",
        "params": [
          "keywords"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "body": "{{keywords}}"
              }
            },
            "script_score": {
              "script": {
                "source": "booland",
                "lang": "boolandscript",
                "params": {
                  "field": "body",
                  "query": "{{keywords}}"
                }
              }
            }
          }
        }
      },
      {
        "name": "body_boolor",
        "params": [
          "keywords"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "body": "{{keywords}}"
              }
            },
            "script_score": {
              "script": {
                "source": "boolor",
                "lang": "boolorscript",
                "params": {
                  "field": "body",
                  "query": "{{keywords}}"
                }
              }
            }
          }
        }
      },
      {
        "name": "body_tfidf",
        "params": [
          "keywords"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "body": "{{keywords}}"
              }
            },
            "script_score": {
              "script": {
                "source": "tfidf",
                "lang": "tfidfscript",
                "params": {
                  "field": "body",
                  "query": "{{keywords}}",
                  "idfs": "{{idfs}}"
                }
              }
            }
          }
        }
      },
      {
        "name": "body_entities",
        "params": [
          "entities"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "entities": "{{entities}}"
          }
        }
      },
      {
        "name": "entities_cm",
        "params": [
          "entities"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "entities": "{{entities}}"
              }
            },
            "script_score": {
              "script": {
                "source": "cm",
                "lang": "cmscript",
                "params": {
                  "field": "entities",
                  "query": "{{entities}}"
                }
              }
            }
          }
        }
      }
    ]
  }
}

Entity Features

Entity features are created by using the tagged entities from the queries and looking up the entity IDs in the entity index to get entity names, descriptions, and aliases. Entity features must be defined in a separate json file from the document features because each query may have a different number of entities. Scores for all entities for each query are summed or set to zero if there are no entities in the query.

{
  "featureset": {
    "name": "entitydetailstest",
    "features": [
      {
        "name": "titledetails",
        "params": [
          "title"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "body": "{{title}}"
          }
        }
      },
      {
        "name": "descriptiondetails",
        "params": [
          "description"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "body": "{{description}}"
          }
        }
      },
      {
        "name": "aliasdetails",
        "params": [
          "aliases"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "body": "{{aliases}}"
          }
        }
      },
      {
        "name": "titledetails_title",
        "params": [
          "title"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "title": "{{title}}"
          }
        }
      },
      {
        "name": "descriptiondetails_title",
        "params": [
          "description"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "title": "{{description}}"
          }
        }
      },
      {
        "name": "aliasdetails_title",
        "params": [
          "aliases"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "title": "{{aliases}}"
          }
        }
      },
      {
        "name": "titledetails_cm",
        "params": [
          "title"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "body": "{{title}}"
              }
            },
            "script_score": {
              "script": {
                "source": "cm",
                "lang": "cmscript",
                "params": {
                  "field": "body",
                  "query": "{{title}}"
                }
              }
            }
          }
        }
      },
      {
        "name": "descriptiondetails_cm",
        "params": [
          "description"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "body": "{{description}}"
              }
            },
            "script_score": {
              "script": {
                "source": "cm",
                "lang": "cmscript",
                "params": {
                  "field": "body",
                  "query": "{{description}}"
                }
              }
            }
          }
        }
      },
      {
        "name": "aliasdetails_cm",
        "params": [
          "aliases"
        ],
        "template_language": "mustache",
        "template": {
          "function_score": {
            "query": {
              "match": {
                "body": "{{aliases}}"
              }
            },
            "script_score": {
              "script": {
                "source": "cm",
                "lang": "cmscript",
                "params": {
                  "field": "body",
                  "query": "{{aliases}}"
                }
              }
            }
          }
        }
      }
    ]
  }
}

Entity Indexing Steps

  1. Increase the heap space used by Elasticsearch. In elasticsearch-6.1.2/config/jvm.options, set -Xmx and -Xms to at least 2G (preferably 4g - 16g if possible.)
  2. Start Elasticsearch.
  3. Download or create qrels file for queries.
  4. Create a properties file for feature generation.
  5. Start indexing with this command: java -jar -Xmx4G featuregenerator.jar featuregenerator.properties. Use at least 2G of heap space (preferably 4G - 8G).
  6. Use the results file to train a model using RankLib