Document Clustering
Contents
1. Overview
Document clustering is the act of collecting similar documents into bins, where similarity is some function on a document. The clustering algorithms implemented for LEMUR are described in "A Comparison of Document Clustering Techniques", Michael Steinbach, George Karypis and Vipin Kumar. TextMining Workshop. KDD. 2000. With the exception of Probabilistic Latent Semantic Analysis (PLSA), all use cosine similarity in the vector space model as their metric.
The LEMUR clustering support provides two principle APIs, the Cluster API, which defines the clusters themselves, and the ClusterDB API, which defines how Clusters are persistently stored. Similarity is scored via a SimilarityMethod object. Currently there is a single SimilarityMethod, CosSim, defined.
2. Applications
Cluster
Performs the basic online clustering task. In conjunction with an incremental indexer (such as using a KeyfileIncIndex), it could be used for the TDT topic detection task. It iterates over the documents in the index, assigning each document that is not in any cluster to a cluster. The document id, cluster id, and score are printed to the standard output. The parameters accepted by Cluster are:
- index -- the index to use. Default is none.
- clusterIndex -- the name of the cluster database index to use. Default is none.
- clusterdb_type -- One of flatfile (simple cluster database) or keyfile (btree based).
- clusterType -- Type of cluster to use, either agglomerative or centroid. Centroid is agglomerative using mean which trades memory use for speed of clustering. Default is centroid.
- simType -- The similarity metric to use. Default is cosine similarity (COS), which is the only implemented method.
- docMode -- The scoring method to use for the
agglomerative cluster type. The default is max (maximum).
The choices are:
- max -- Maximum score over documents in a cluster.
- mean -- Mean score over documents in a cluster. This is identical to the centroid cluster type.
- avg -- Average score over documents in a cluster.
- min -- Minimum score over documents in a cluster.
- threshold -- Minimum score for adding a document to an existing cluster. Default is 0.25.
OfflineCluster
The example application that demonstrates the basic offline clustering task. Provides k-means and bisecting k-means partitional clustering. It will run each algorithm on the first 100 documents in the index (or all of them if less than 100) and print out the results. The parameters accepted by OfflineCluster are:
- index -- the index to use. Default is none.
- clusterType -- Type of cluster to use, either agglomerative or centroid. Centroid is agglomerative using mean which trades memory use for speed of clustering. Default is centroid.
- simType -- The similarity metric to use. Default is cosine similarity (COS), which is the only implemented method.
- docMode -- The integer encoding of the scoring method to use for the
agglomerative cluster type. The default is max (maximum).
The choices are:
- max -- Maximum score over documents in a cluster.
- mean -- Mean score over documents in a cluster. This is identical to the centroid cluster type.
- avg -- Average score over documents in a cluster.
- min -- Minimum score over documents in a cluster.
- numParts -- Number of partitions to split into. Default is 2
- maxIters -- Maximum number of iterations for k-means. Default is 100.
- bkIters -- Number of k-means iterations for bisecting k-means. Default is 5.
PLSA
Perform Probabilistic Latent Semantic Analysis (PLSA) on a collection, building three probability tables: P(z), P(d|z), and P(w|z) where z ∈ Z are the latent variables (categories), d ∈ D are the documents in the collection, and w ∈ W are the terms in the vocabulary over the collection, or open those tables and read them into memory to illustrate their potential use. The implementation (the PLSA class) is based on the Java reference implementation from Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, and David M. Pennock. "Methods and Metrics for Cold-Start Recommendations." in Proceedings of the 25'th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002). See http://www.cis.upenn.edu/datamining/software_dist/PennAspect/. Note that training takes a long time.
The parameter doTrain (true|false) determines whether the tables are constructed or read. The default value is true. The other parameters accepted by PLSA are:
- index -- the index to use. Default is none.
- numCats -- the number of latent variables (categories) to use. Default is 20.
- beta -- The value of beta for Tempered EM (TEM). Default is 1.
- betaMin -- The minimum value for beta, TEM iterations stop when beta falls below this value. Default is 0.6.
- eta -- Multiplier to scale beta before beginning a new set of TEM iterations. Must be less than 1. Default is 0.92.
- annealcue -- Minimum allowed difference between likelihood in consecutive iterations. If the difference is less than this, beta is updated. Default is 0.
- numIters -- Maximum number of iterations to perform. Default is 100.
- numRestarts -- Number of times to recompute with different random seeds. Default is 1.
- testPercentage -- Percentage of events (d,w) to hold out for validation.
- doTrain -- whether to construct the probability tables or read them in. Default is true.
3. Clustering API
Cluster
The cluster API provides an abstraction over a collection of cluster elements (ClusterElt), enabling the addition or removal of elements. It provides a score method for that uses the similarity between the object and another Cluster (see SimilarityMethod below). It also provides read and write methods for use by the ClusterDB. The two concrete subclasses of Cluster are AgglomCluster and CentroidCluster. Cluster instances are created via the ClusterFactory::allocateCluster method.
ClusterDB
The ClusterDB API provides for interactions with persistent collections of Cluster objects. There are two concrete implementors of this API. FlatfileClusterDB, which stores the Cluster objects in a flat file, similar to those used by the Inv(FP)Index class, and KeyfileClusterDB, which stores the Cluster objects in Keyfiles. ClusterDB implementors must provide methods to add or remove elements from a Cluster, merge two Clusters, or split a Cluster into multiple clusters. They must provide a way to retrieve a Cluster given its id number or given the document id of an element within the cluster. Finally they need to provide a Factory method for creating a new, empty Cluster.
SimilarityMethod
SimilarityMethod is an abstraction over comparing two ClusterRep (vector space representation) Cluster objects. The CosSim method is the only concrete implementation, it computes the cosine similarity. SimilarityMethods must provide two methods, one to weigh a vector (such as normalizing) and the similiarity function itself. SimilarityMethod objects are created by the SimFactory::makeSim method, using parameters defined in the ClusterParam namespace. To add a new SimilarityMethod, one needs to do the following:
- In ClusterParam.hpp add a symbol for the new method in the simTypes enum.
- In ClusterParam.hpp add an else if to test the simTypeString for equality with the new method type parameter.
- In SimFactory.hpp add an include statement for the new method's header file.
- In SimFactory.hpp add a case for the new method symbol to makeSim that makes an instance of the new method.
- Recompile your lemur library.
OfflineCluster
OfflineCluster provides k-means and bisecting k-means clustering of a set of documents, returning the k clusters in a vector. The clusters are not persistent. The Cluster class uses k-means to implement its split method.