Quick Start

It isn't difficult to get started with Sifaka. Just follow the steps below.

  1. Java 8: Sifaka requires the 64-bit version of Java 8. If you don't have it already, download the Java 8 Runtime Environment (JRE).

  2. Download Sifaka: It is available from SourceForge, on the Lemur Project page.

  3. Index documents: Sifaka uses a document index that enables it to search and analyze your documents quickly. Thus, the next step is to build an index for your documents.

    If you want to skip this step: The Lemur Project provides a sample index for a subset of the classic Reuters 31578 dataset. Look on SourceForge for sampleReutersIndex.zip on the Lemur Project page.

    To build a document index: Sifaka can index documents in plain text and simplified TREC format. (The Document Parser Tutorial provides information about how to build parsers for other document formats.) The first step is to create a properties file that describes your documents. The second step is to run SifkaBuildIndex. When running SifakaBuildIndex, make sure the the classifiers and models directory are in the same directory as sifakaBuildIndex.jar.

    1. Create a properties file (index.properties) in the same directory as sifakaBuildIndex.jar.

      # Specifies the document parser to use
      # Options: [text, trec, wsj]
      # text = Plain text files where each file is a document
      # trec = Simplified TREC format with DOCNO, HEADLINE, TEXT, and CLASS xml tags
      # wsj = TREC Wall Street Journal format
      documentType=text

      # data options
      dataDirectory=PATH_TO_DATA
      indexDirectory=PATH_TO_INDEX #Index directory is created by Sifaka

      # analyzer options
      stemmer=kstem       # Krovetz stemmer
      removeStopwords=true
      ignoreCase=true

      # CSV list of annotations
      # Options: [ne, noun-phrase, bigram, trigram]
      # ne = named entities (person, location, and organization)
      annotation.types=ne,bigram,trigram,noun-phrase

    2. Run Sifaka Build Index: The process for starting Sifaka Build Index is a little different on different operating systems.

      Windows: There are several options for running Sifaka in Windows.

      1. Windows 7: Open a command prompt. Navigate to the directory which contains sifaka.jar. Type: java -jar sifakaBuildIndex.jar index.properties

      2. Windows 10: Open a bash shell. Navigate to the directory which contains sifaka.jar. Type: java -jar sifakaBuildIndex.jar index.properties

      Mac: Open a terminal. Navigate to the directory which contains sifaka.jar. Type: java -jar sifakaBuildIndex.jar index.properties

      Linux: Open a terminal. Navigate to the directory which contains sifaka.jar. Type: java -jar sifakaBuildIndex.jar index.properties

  4. Analyze documents: The process for starting Sifaka is a little different on different operating systems.

    Windows:

    1. All versions: Double-click on sifakaTextMiner.jar. This is the best choice for most people.

    2. Windows 7: Open a command prompt. Navigate to the directory that contains sifaka.jar. Type: java -jar sifakaTextMiner.jar

    3. Windows 10: Open a bash shell. Navigate to the directory that contains sifaka.jar. Type: java -jar sifakaTextMiner.jar

    Mac: Open a terminal. Navigate to the directory that contains sifaka.jar. Type: java -jar sifakaTextMiner.jar

    Linux: Open a terminal. Navigate to the directory that contains sifaka.jar. Type: java -jar sifakaTextMiner.jar