Word Entity Duet: Document Parsers
The Word Entity Duet indexing application includes document parsers for ClueWeb and Wall Street Journal documents.
ClueWeb Document Parser
The ClueWeb Document Parser can parse and index ClueWeb09 and ClueWeb12 documents. The parser reads the gzipped warc
files. There is no need to unzip the warc files before using them in the indexer. The ClueWeb parser divides documents
into the fields described below.
"src/main/java/org/lemurproject/wordentityduet/indexer/parser/WSJParser.java"
- internalid: A numeric sequence generated ID
- externalid: The ClueWeb document ID
- title: The text within the title tags in the document
- heading: The text within all heading tags in the document
- url: The url of the document
- body: The entire document including html tags
- entities: A string of entity IDs separated by spaces
Wall Street Journal Document Parser
The Wall Street Journal document parser can parse and index the Wall Street Journal articles from 1987-1992.
The Wall Street Journal parser divides documents into the fields described below.
- internalid: A numeric sequence generated ID
- externalid: The WSJ document ID
- title: The text between the HL tags
- summary: The text between the LP tags
- subject: The text between the IN tags
- body: The text between the TEXT tags
- all: The entire document including tags
- entities: A string of entity IDs separated by spaces