Abstract

In the field of text mining, topic modeling and detection are fundamental problems in public opinion monitoring, information retrieval, social media analysis, and other activities. Document clustering has been used for topic detection at the document level. Probabilistic topic models treat topics as a distribution over the term space, but this approach overlooks the semantic information hidden in the topic. Thus, representing topics without loss of semantic information as well as detecting the optimal topic is a challenging task. In this study, we built topics using a network called a topic graph, where the topics were represented as concept nodes and their semantic relationships using WordNet. Next, we extracted each topic from the topic graph to obtain a corpus by community discovery. In order to find the optimal topic to describe the related corpus, we defined a topic pruning process, which was used for topic detection. We then performed topic pruning using Markov decision processes, which transformed topic detection into a dynamic programming problem. Experimental results produced using a newsgroup corpus and a science literature corpus showed that our method obtained almost the same precision and recall as baseline models such as latent Dirichlet allocation and KeyGraph. In addition, our method performed better than the probabilistic topic model in terms of its explanatory power and the runtime was lower compared with all three baseline methods, while it can also be optimized to adapt the corpus better by using topic pruning.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call