Probabilistic Latent Semantic Analysis (PLSA) is a fundamental text analysis technique that models each word in a document as a sample from a mixture of topics. PLSA is the precursor of probabilistic topic models including Latent Dirichlet Allocation (LDA). PLSA, LDA and their numerous extensions have been successfully applied to many text mining and retrieval tasks. One important extension of LDA is supervised LDA (sLDA), which distinguishes itself from most topic models in that it is supervised. However, to the best of our knowledge, no prior work extends PLSA in a similar manner sLDA extends LDA by jointly modeling the contents and the responses of documents. In this paper, we propose supervised PLSA (sPLSA) which can efficiently infer latent topics and their factorized response values from the contents and the responses of documents. The major challenge lies in estimating a document’s topic distribution which is a constrained probability that is dictated by both the content and the response of the document. To tackle this challenge, we introduce an auxiliary variable to transform the constrained optimization problem to an unconstrained optimization problem. This allows us to derive an efficient Expectation and Maximization (EM) algorithm for parameter estimation. Compared to sLDA, sPLSA converges much faster and requires less hyperparameter tuning, while performing similarly on topic modeling and better in response factorization. This makes sPLSA an appealing choice for latent response analysis such as ranking latent topics by their factorized response values. We apply the proposed sPLSA model to analyze the controversy of bills from the United States Congress. We demonstrate the effectiveness of our model by identifying contentious legislative issues.
Read full abstract