Parsing in Lemur
Contents
1. Overview
This document discusses the parsing utilities provided by the Lemur toolkit. They have been designed with flexibility and extendibility in mind. If the functionality required is not currently implemented by the toolkit, it should be easy to add the functionality and plug it into the parser framework. The first section describes the parser applications and their options. The other section describes the parser architecture or API for developers.
2. The Parser Architecture for Lemur
The Lemur parser architecture revolves around one class, TextHandler, that allows for the chaining or pipelining of common parser components. A TextHandler may be a stop-word list, stemmer, indexer, or parser. Information is passed from a source, through TextHandlers that modify information and pass it on, to a destination TextHandler. An example of a source TextHandler would be a parser. A stemmer would modify text and pass the information on to other TextHandlers. A destination TextHandler might write parsed data to a file or push build an index.
The TextHandler class enforces chaining through its interface. Functions of the TextHandler class are described below.The next TextHandler in a chain is set using the setTextHandler function. For example calling the Parser's setTextHandler function with an argument of the Stop-word list would cause information to be passed from the Parser to the Stop-word list. TextHandlers may modify the information it receives before passing the information on to the next TextHandler. Although a TextHandler may modify the information it receives, it also passes along the original information. It can also pass a list of Property objects associated with that token. Base implementations of all functions are provided by the TextHandler class; a subclass will only need to override the functions that it needs.
The TextHandler class provides the basis for most of the classes used by Lemur for parsing. The hope is that this class will provide a flexible base for extending parser functionality.
3. Supporting classes
The following subsections discuss important members and supporting classes related to the TextHandler.
TextHandler::TokenType used with
void foundToken(TextHandler::TokenType type, char * token, char * orig, PropertyList * properties);
TokenType is an enumeration including words, tags, and document boundary markers.
You may add to this list of types for your own tools. For example,
you may wish to use a parser identifies sentence boundaries. An appropriate way to
pass this information along the TextHandler chain would be to add types for
beginning of sentence and end of sentence boundaries. Here's a list of the current
types:
- WORDSTR
Calling foundToken with TextHandler::WORDSTR as the token type is equivalent to the foundWord call of the old TextHandler class. - BEGINDOC
The BEGINDOC type is reserved for signaling the beginning of a document. The token and orig arguments to foundToken should contain the document number. (This call is equivalent to the old foundDoc function.) - ENDDOC
The ENDDOC type is used to signal the end of a document. Classes using the TextHandler expect this call; make sure your parsers produce it. - BEGINTAG
This type has been added for support of XML. This could also be used for HTML or SGML parsers. Or even more generally, it could be used to represent hierarchical structure boundaries.
The token and original arguments should contain only the type of the tag. If the tag is <h3 align="center">, then token and orig should contain "h3". The properties argument to the foundToken call should align information. - ENDTAG
This type has also been added for support of XML. The token argument should contain just the type of the tag (i.e. "h3").
Property
A Property object will generally have a name (so it can be retrieved from a list>, a data type, a data size, and the data value. Any data type can be added to a Property through the use of the overloaded setValue() method. However, you have to modify the class if you want your own type be recognized and not be returned as a Property::UNKNOWN type (when getType() is called). Name and values are copied when set so the Property has its own memory management.
PropertyList
A PropertyList is a container for properties of tokens. Example properties may be the byte offset of the token in the file, attributes associated with a tag, document properties, and so on. Items in the property list are (name, value) pairs.A PropertyList object is owned by its creator. That is, you should not assume that the properties in it will be the same in subsequent calls to TextHandler::foundToken. The creator is also responsible for freeing the memory associated with the list.
TextHandlerManager
This class facilitates the creation of Parser, Stemmer, and Stopper objects. Any new TextHandler class can be added just to the TextHandlerManager to be utilized by all existing applications that use the TextHandlerManager. It accepts what type to create as a parameter, but will check the parameter stack if nothing is specified.4. TextHandler classes
The following subsections discuss TextHandler classes used by the Lemur applications. The only one of the following classes that does not extend the TextHandler class is the WordSet class.
WordSet
The WordSet class is a simple wrapper to a set. It is useful for stop-word lists or acronym lists. It can load a list from a file. The file format is one word per line. WordSet does NOT remove white space on either side of the word be careful when editing these files. The contains function is used to check the presence of a word in the set.
Parser
The Parser class is a generic interface for the parsers in the toolkit. It assumes subclasses implement a parse function, which takes a filename. The acronym list is WordSet, and some of the toolkit parsers check uppercase words and recognized acronyms against this list. If the word is in the acronym list, it is left uppercase. Otherwise, the word is converted to lowercase. If you do not wish to support the acronym list when you design your parser, that is fine. You can simply ignore the acronym list.
Both the TrecParser and the WebParser remove contractions and possessives, have a simple acronym recognizer, and convert words to lowercase.
The parsers assume that there is some SGML style markup seperating documents and specifying document number. The format for web documents is<DOC>
<DOCNO> document_number </DOCNO>
document text
</DOC>
and the format for trec formatted documents is
<DOC>
<DOCNO> document_number </DOCNO>
<TEXT>
document text
</TEXT>
</DOC>
These document formats allow the inclusion of multiple documents in the same text file.
TrecParser
The TrecParser provides a simple but effective parser for NIST's TREC document format. It recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields.
WebParser
The WebParser behaves very similarly to the TrecParser. It parses HTML documents in the NIST TREC format used for the Web Tracks.The parser removes HTML tags. Text within SCRIPT tags is removed, as is text in HTML comments.
ReutersParser
The ReutersParser extracts the TEXT, HEADLINE, and TITLE fields and removes other tags.
BrillPOSParser
Similar to WebParser in terms of tags to separate document but recognizes terms with "/" slashes in them. This is the usual output from a Brill part of speech tagger: term/pos. Use in combination with a BrillPOSTokenizer, which tokenizes at the separator and pass the part of speech along as a Property.IdentifinderParser
Similar to WebParser in terms of tags to separate document but recognizes. Extracts named entities from tags output by Indentifinder and passes them along as a Property objects. Prefixes are added to the tags to indicate the begin and end of multi-token entities. For example, if "Carnegie Mellon University" was identified as a place, it would be parsed with the following properties:Carnegie [place] [b_place] Mellon [place] University [place] [e_place]A single token entity, like Madonna would be
Madonna [person] [b_person] [e_person]
InQueryOpParser
The ArabicParser provides parsing for the InQuery structured query language structured queries.
ArabicParser
The ArabicParser provides a simple but effective parser for NIST's TREC document format for Arabic documents encoded in Windows CodePage 1256 encoding (CP1256). It recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields.
InqArabicParser
The InqArabicParser provides parsing for the InQuery structured query language structured queries in Arabic encoded in Windows CodePage 1256 encoding (CP1256).
ChineseParser
The ChineseParser provides a simple but effective parser for NIST's TREC document format for Chinese documents encoded in GB encoding (GB2312). It recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. This parser is suitable for parsing segmented (tokenized) documents.
ChineseCharParser
The ChineseCharParser provides a simple but effective parser for NIST's TREC document format for Chinese documents encoded in GB encoding (GB2312). It recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. This parser is suitable for parsing unsegmented documents, producing one token per Chinese character.
Stemmer
The Stemmer class provides an interface for stemmers. All that is required of a subclass is that it implement the stemWord function. The stemWord function may overwrite the current word, but should return the stem as its return value. Currently, the toolkit provides three subclasses; PorterStemmer, KStemmer, and ArabicStemmer.
PorterStemmer
PorterStemmer uses Porter's official stemmer (in c) to stem words. The PorterStemmer class does not stem words beginning with an uppercase letter. This is to prevent stemming of acronyms or names.
KStemmer
KStemmer uses Krovetz' stemmer (in c) to stem words. This is a less aggressive stemmer than the Porter stemmer.
ArabicStemmer
ArabicStemmer uses one of Larkey's Arabic stemmers (in c) to stem Arabic words. It provides five different stemming functions:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light stemming
- arabic_light10_stop : light stemming with stopping
Stopper
The Stopper class is a subclass of the WordSet class and the TextHandler class. It replaces words in the stop-word list with a NULL pointer.
QueryTextHandler
The QueryTextHandler checks to see if a word in the query occurs more often in uppercase than original form in an Index. If the uppercase form is more common than the original form, the word is added to the query. This is to handle cases where acronyms are not capitalized in the query,
WriterTextHandler
The WriterTextHandler class writes information from a TextHandler chain to a file.
WriterInQueryHandler
The WriterInQueryHandler class writes information from a TextHandler chain processing the InQuery structured query language to a file.
KeyfileTextHandler
The KeyfileTextHandler takes information from a TextHandler chain to build a KeyfileIncIndex. Stop-words are not counted in the document length.
5. The Parser Applications
There are some parser applications provided in the toolkit. ParseToFile writes parsed text to a file, ParseQuery parses queries and writes output to file, and ParseInQueryOp parses InQuery structured query language queries and writes output to file.ParseToFile parses documents and writes output in BasicDoc format. The program uses one of the toolkit's Parser classes to parse.
Usage: ParseToFile paramfile datfile1 datfile2 ...
Summary of parameters in paramfile:
- outputFile Name of file to output parsed documents to.
- stopwords Name of file containing stopword list. Words in this file should be one per line. If this parameter is not specified, all words are output to the file.
- acronyms Name of file containing acronym list (one word per line). Uppercase words recognized as acronyms (e.g. USA U.S.A. USAs USA's U.S.A.) are left uppercase if in the acronym list. If no acronym list is specified, acronyms will not be recognized.
- docFormat:
- "trec" for standard TREC formatted documents
- "web" for web TREC formatted documents
- "chinese" for segmented Chinese text (TREC format, GB encoding)
- "chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
- "arabic" for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- "porter" Porter stemmer.
- "krovetz" Krovetz stemmer.
- "arabic" arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
ParseQuery parses queries using one of the toolkit's Parser classes and an Index.
Usage: ParseQuery paramfile datfile1 datfile2 ...
Summary of parameters in paramfile:
- qryOutFile The name of the file to write the parsed queries to.
- index Name of the index (with the extension).
- stopwords Name of file containing stopword list. Words in this file should be one per line. If this parameter is not specified, all words are left in the query.
- acronyms Name of file containing acronym list (one word per line). Uppercase words recognized as acronyms (eg USA U.S.A. USAs USA's U.S.A.) are left uppercase as USA if USA is in the acronym list. If no acronym list is specified, acronyms will not be recognized.
- docFormat:
- "trec" for standard TREC formatted documents
- "web" for web TREC formatted documents
- "chinese" for segmented Chinese text (TREC format, GB encoding)
- "chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
- "arabic" for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- "porter" Porter stemmer.
- "krovetz" Krovetz stemmer.
- "arabic" arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
ParseInQueryOp parses queries using the InQueryOpParser class.
Usage: ParseQuery paramfile datfile1 datfile2 ...
The parameters are:
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- docFormat:
- "trec" for standard TREC formatted documents
- "web" for web TREC formatted documents
- "chinese" for segmented Chinese text (TREC format, GB encoding)
- "chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
- "arabic" for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- "porter" Porter stemmer.
- "krovetz" Krovetz stemmer
- "arabic" arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- outputFile: name of the output file.