Probabilistic Latent Semantic Analysis

This application will either perform PLSA on a collection, building three probability tables: P(z), P(d|z), and P(w|z) where z in Z are the latent variables (categories), d in D are the documents in the collection, and w in W are the terms in the vocabulary over the collection, or open those tables and read them into memory to illustrate their potential use.

The parameter doTrain (true|false) determines whether the tables are constructed or read. The default value is true.

The other parameters accepted by PLSA are:

index -- the index to use. Default is none.
numCats -- the number of latent variables (categories) to use. Default is 20.
beta -- The value of beta for Tempered EM (TEM). Default is 1.
betaMin -- The minimum value for beta, TEM iterations stop when beta falls below this value. Default is 0.6.
eta -- Multiplier to scale beta before beginning a new set of TEM iterations. Must be less than 1. Default is 0.92.
annealcue -- Minimum allowed difference between likelihood in consecutive iterations. If the difference is less than this, beta is updated. Default is 0.
numIters -- Maximum number of iterations to perform. Default is 100.
numRestarts -- Number of times to recompute with different random seeds. Default is 1.
testPercentage -- Percentage of events (d,w) to hold out for validation.
doTrain -- whether to construct the probability tables or read them in. Default is true.

Generated on Tue Jun 15 11:02:58 2010 for Lemur by

1.3.4