Word Entity Duet: Document Parsers

The Word Entity Duet indexing application includes document parsers for ClueWeb and Wall Street Journal documents.

ClueWeb Document Parser

The ClueWeb Document Parser can parse and index ClueWeb09 and ClueWeb12 documents. The parser reads the gzipped warc files. There is no need to unzip the warc files before using them in the indexer. The ClueWeb parser divides documents into the fields described below.

"src/main/java/org/lemurproject/wordentityduet/indexer/parser/WSJParser.java"

internalid: A numeric sequence generated ID
externalid: The ClueWeb document ID
title: The text within the title tags in the document
heading: The text within all heading tags in the document
url: The url of the document
body: The entire document including html tags
entities: A string of entity IDs separated by spaces

Wall Street Journal Document Parser

The Wall Street Journal document parser can parse and index the Wall Street Journal articles from 1987-1992. The Wall Street Journal parser divides documents into the fields described below.

internalid: A numeric sequence generated ID
externalid: The WSJ document ID
title: The text between the HL tags
summary: The text between the LP tags
subject: The text between the IN tags
body: The text between the TEXT tags
all: The entire document including tags
entities: A string of entity IDs separated by spaces