Sifaka Feature Vectors

The Feature Vector tab is for finding features most highly correlated with each document label and creating feature vectors which can be used by machine learning software for classification. Users can select categories from either document labels or saved sets of documents for creating feature vectors.

Overview of Feature Vectors input

Example

  1. Select an index from the Indexes pane, for example: reuters.

  2. Select the Feature Vectors tab in the right content tab pane.

  3. Select the Label Type, for example: topics.

  4. Select the Feature types that you want, for example: term and noun-phrase.

  5. Enter a Minimum Frequency for feature values, for example: 10. Features with smaller values will not be included in the feature vector.

  6. Select the Negative Documents. Choose random to use a random sampling of negative documents proportional to the size of the label categories selected.

  7. Select the Label Categories for calculating kappa values. The label categories table is sortable by label value or count by clicking on the column header. In this example, the five label categories with the highest counts (earn, acq, money-fx, grain, crude) are selected.

  8. Press View Features button. This experiment may take several minutes. A progress indicator will display as Sifaka calculates the feature scores. Note: If the experiment takes longer than 15 minutes to run, check the java version installed.

  9. When the kappa values are finished calculating, a table on the right will appear with each category and the kappa scores at the first, tenth, fiftieth, and one hundreth highest ranked feature in each row.

    Feature Vector Kappa Calculations displayed
  10. At the end of each row there is a View button. Click on that button to view all the features and their kappa scores for each category.

    Features and scores for selected category
  11. When exporting to feature vectors, all features are selected by default. To filter the number of features, enter feature selection criteria. Check Top number of features to select and enter the number of features to select a certain number of features from each category. Check Feature score above threshold to select a kappa value that a feature must be above to be exported. If both criteria are selected, choose AND to select features that fit both criteria, and choose OR to select features that satisfy at least one of the criteria.

    Feature Selection Criteria
  12. Select whether to export Feature Weights as Binary, TF (term frequency), or TF-IDF (term frequency * inverse term frequency)
  13. Click Save Results to export feature vectors to an ARFF file that can used by WEKA.