Abstract
Topic Modelling has been successfully applied in many text mining applications such as natural language processing, information retrieval, information filtering, etc. In information filtering systems (IFs), user interest representation is the core part which determines the success of the system. Topics in a topic model generated from a user's documents can be used to represent the user's information interest. However, the quality of a topic model generated from a document collection is not always accurate because the topics of the topic model might contain meaningless or ambiguous words. This ambiguity problem can affect the performance of IFs which use a topic model to represent user information interest. Hence, a topic evaluation method to assess the quality of topics in a topic model is important for ensuring the effectiveness of utilizing the topic model in text mining applications. One method in measuring the quality of a topic model is to match the topical words of the model to concepts in an ontology. However, a limitation of this method is that some topical words in an examined topic cannot be found in the mapping ontology. In this study, we propose a new model to evaluate the quality of topics by matching concepts in an ontology. In particular, word embedding technique is applied to dealing with the ambiguity problem by finding similar concept words based on word embeddings. The assessed topics are then used in an information filtering system for filtering relevant documents for a user. The proposed model was evaluated against some state-of-the-art baseline models in terms of term-based, phrase-based, and topic-based user interest representations, and also some topic evaluation models. The result of the evaluation shows that the new proposed model outperforms the state-of-the-art baseline models.
Highlights
The past decade has seen the rapid development of topic modelling in understanding text corpus
THE PROPOSED TOPIC EVALUATION MODEL This paper proposes a model, named Semantic based Topic Evaluation (SbTE), to evaluate the quality of topics generated from a document collection based on the semantics of the documents
We evaluated the performance of topic evaluation by applying the assessed topics to document ranking in information filtering systems
Summary
The past decade has seen the rapid development of topic modelling in understanding text corpus. Among the stateof-the-art models, Latent Dirichlet Allocation LDA [1]–[3] is the most popular technique, which provides an explicit representation of documents. In LDA, documents can be represented by a probability distribution of topics and each topic is a probability distribution of words. The topic model based document representation has been successfully applied to many text mining applications. The topics generated by LDA still have limitations. Ambiguous or meaningless topical words and topics were reported in [4] as a common limitation of topic models in general. Many topical words are ambiguous and noisy [4].
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have