Online Clustering

This application performs the basic online clustering task. In conjunction with an incremental indexer (such as KeyfileIncIndex), it could be used for TDT topic detection task. It iterates over the documents in the index, assigning each document that is not in any cluster to a cluster. The document id, cluster id, and score are printed to the standard output.

The parameters accepted by Cluster are:

index -- the index to use. Default is none.
clusterIndex -- the name of the cluster database index to use. Default is "clusterIndex".
clusterDBType -- One of flatfile (simple cluster database) or keyfile (btree based).
clusterType -- Type of cluster to use, either agglomerative or centroid. Centroid is agglomerative using mean which trades memory use for speed of clustering. Default is centroid.
simType -- The similarity metric to use. Default is cosine similarity (COS), which is the only implemented method.
docMode -- The scoring method to use for the agglomerative cluster type. The default is max (maximum). The choices are:
- max -- Maximum score over documents in a cluster.
- mean -- Mean score over documents in a cluster. This is identical to the centroid cluster type.
- avg -- Average score over documents in a cluster.
- min -- Minimum score over documents in a cluster.
threshold -- Minimum score for adding a document to an existing cluster. Default is 0.25.

Generated on Tue Jun 15 11:02:58 2010 for Lemur by

1.3.4