Using and Implementing a Sifaka Document Parsers

Sifaka comes with two document parsers: a plain text file parser and a simplified TREC parser. It is also easy to implement a document parser for additional document types. To use any of these parsers to build a Sifaka index, read the instructions in: Quick start.

Additionally, there is a sample pre-built Sifaka index with reuters (sampleReutersIndex.zip) data available on SourceForge: SourceForge Lemur Project Page

How to use an existing document Parser.
- Plain text document parser: Indexes each plain text file as a document with an internalId, externalId, and body. When running sifakaBuildIndex, define documentType=text in index.properties to use this parser.
- Reuters document parser: Indexes each reuters document with an internalId, externalId, headline, and body. To build an index with the Reuters-21578 Text Categorization Data Set from UCI, download the dataset: reuters21578.tar.gz
- HTML document parser: Indexes each html document in a directory with an internalId, externalId, and body.
- WARC document parser: Indexes each html document in a WARC file(s) with an internalId, externalId, and body. Datasets that contain WARC files are ClueWeb09 and ClueWeb12 and downloads from archive sites such as http://commoncrawl.org and https://archive.org/.
- Twitter document parser: Indexes each tweet with an internalId, externalId, and body. The twitter document parser can parse tweets in the twitter spritzer format.
- Simplified TREC document parser: Indexes one or more xml files where each document is contained between DOC xml tags. The body is between TEXT tags. There document title is between HEADLINE tags. The document label uses the CLASS tag. When running sifakaBuildIndex, define documentType=trec in index.properties to use this parser.
  
  Sample document:
  <DOC> <HEADLINE>SAMPLE HEADLINE</HEADLINE> <TEXT>Sample text body.</TEXT> <CLASS>sample</CLASS> </DOC>
- Wall Street Journal document parser: Indexes one or more xml files where each document is contained between DOC xml tags. When running sifakaBuildIndex, define documentType=wsj in index.properties to use this parser.
  
  Sample document:
  <DOC> <DOCNO> WSJ870313-0162 </DOCNO> <HL> Southwest Air February Traffic</HL> <DD> 03/13/87</DD> <SO> WALL STREET JOURNAL (J)</SO> <IN> LUV AIRLINES (AIR) </IN> <DATELINE> DALLAS </DATELINE> <TEXT> Sample Wall Street Journal article. </TEXT> </DOC>
How to implement a new document Parser.
1. Create a new class in org.lemurproject.sifaka.buildindex.lucene.documentparser package which implements DocumentParser.
2. Implement the required methods
3. Add the new parser class to the docParserMap in the constructor of org.lemurproject.sifaka.buildindex.lucene.factory.DocumentParserFactory with a short but descriptive key value.
4. To use the new parser with sifakaBuildIndex, define documentType=[NEW_KEY_VALUE] in index.properties. The NEW_KEY_VALUE should match the key value that was added to the docParserMap in the DocumentParserFactory.