Accuracy of separable nonnegative matrix factorization for topic extraction

Hendri Murfi

doi:10.1145/3162957.3162996

Abstract

Topic extraction is an automatic method to extract topics in textual data. The popular method of topic extraction is latent Dirichlet allocation (LDA) which is a probabilistic topic model. Because of some limitations of learning the model parameters, e.g. NP-hard, several researchers continue the work to design methods with polynomial complexities. The developing alternative approach is the nonnegative matrix factorization (NMF) based method. Under a separability assumption, a direct method that runs in polynomial time is proposed. In general, this algorithm works in three steps: first, generating a word cooccurrence matrix, choosing anchor words for each topic, and then in the recovery step, it directly reconstructs the topics given the anchor words. In this paper, we examine the accuracy of the separable nonnegative matrix factorization (SNMF). Firstly the accuracy of SNMF is strongly influenced by the anchor words. In this case, the accuracy of SNMF is significantly improved when we find the anchr words in Eigenspace, instead of random space. Moreover, SNMF gives the higher accuracy than LDA, however, the lower accuracy than NMF.

Full Text