Abstract

A novel method is proposed for updating an already trained asymmetric and symmetric probabilistic latent semantic analysis (PLSA) model within the context of a varying document stream. The proposed method is coined online PLSA (oPLSA). The oPLSA employs a fixed-size moving window over a document stream to incorporate new documents and at the same time to discard old ones (i.e., documents that fall outside the scope of the window). In addition, the oPLSA assimilates new words that had not been previously seen (out-of-vocabulary words), and discards the words that exclusively appear in the documents to be thrown away. To handle the new words, Good-Turing estimates for the probabilities of unseen words are exploited. The experimental results demonstrate the superiority in terms of accuracy of the oPLSA over well known PLSA updating methods, such as the PLSA folding-in (PLSA fold.), the PLSA rerun from the breakpoint, the quasi-Bayes PLSA, and the Incremental PLSA. A comparison with respect to the CPU run time reveals that the oPLSA is the second fastest method after the PLSA fold. However, the better accuracy of the oPLSA than that of the PLSA fold. pays off the longer computation time. The oPLSA and the other PLSA updating methods together with online LDA are tested for document clustering and F1 scores are also reported.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.