Word Entity Duet: Document Parsers

The Word Entity Duet indexing application includes document parsers for ClueWeb and Wall Street Journal documents.

ClueWeb Document Parser

The ClueWeb Document Parser can parse and index ClueWeb09 and ClueWeb12 documents. The parser reads the gzipped warc files. There is no need to unzip the warc files before using them in the indexer. The ClueWeb parser divides documents into the fields described below.

"src/main/java/org/lemurproject/wordentityduet/indexer/parser/WSJParser.java"

Wall Street Journal Document Parser

The Wall Street Journal document parser can parse and index the Wall Street Journal articles from 1987-1992. The Wall Street Journal parser divides documents into the fields described below.