Sifaka Feature Vectors

The Feature Vector tab is for finding features most highly correlated with each document label and creating feature vectors which can be used by machine learning software for classification. The Feature Vector tab will only open for labeled datasets.

Overview of Feature Vectors input

Example

  1. Select an index from the Indexes pane, for example: reuters.

  2. Select the Feature Vectors tab in the right content tab pane.

  3. Select the Label Type, for example: class.

  4. Select the feature types that you want, for example: term and noun-phrase.

  5. Enter a Minimum Frequency for feature values, for example: 5. Features with smaller values will not be included in the feature vector.

  6. Press View Features button. This experiment may take several minutes. A progress indicator will display as Sifaka calculates the feature scores. Note: If the experiment takes longer than 15 minutes to run, check the java version installed.

    Example: Feature Vectors screen
  7. The feature scores for each class sorted by descending kappa values will display to the right of the input.

  8. Click on category tabs to see which features have the highest kappa scores for each category; in this example, the acq, coffee, earn, gold, heat, housing, and neg tabs are shown.

  9. You can Select All Features for export; or, you can indicate the Number of Features to Select for each category, for example: 50.

  10. Click Save Results to export feature vectors to an ARFF file that can used by WEKA.