Offset annotation support
IndriBuildIndex accepts the parameter annotations to specify a file containing offset annotations for the documents in a collection. Specified as:
<corpus>
<annotations>
/path/to/file
</annotations>
</corpus>
<parserName>OffsetAnnotationAnnotator</parserName>
Offset Annotation File Format
Format of the offset annotation file: 9-column, tab-delimited. From left-to-right, those columns are:
- docno
- external document id corresponding to the document in which the annotation occurs.
- type
- TAG or ATTRIBUTE
- id
- an id number for the annotation; each line should have a unique id >= 1.
- name
- for TAG, name or type of the annotation for ATTRIBUTE, the attribute name, or key
- start
- start and length define the annotation's extent. The values should be byte offsets relattive to the start of the document.
- length
- meaningless for an ATTRIBUTE. The number of bytes the annotation spans.
- value
- for TAG, an optional INT64 (for numeric values) for ATTRIBUTE, a string that is the attribute's value
- parentid
- for TAG, refers to the id number of another TAG to be considered the parent of this one; this is how hierarchical annotations can be expressed. a TAG that has no parent has parentid = 0 for ATTRIBUTE, refers to the id number of a TAG to which it belongs and from which it inherits its start and length. *NOTE: the file must be sorted such that any line that uses a given id in this column must be *after* the line that uses that id in the id column.
- debug
- ignored by the OffsetAnnotator; can contain any information that is beneficial to a human reading the file