Abstract

• We propose a Capsule Semantic Graph (CSG) to represent the news documents. • The CSG can effectively capture the relationship between words and semantic of news documents. • We introduce the graph kernel to measure the similarity between CSGs. • Our method can better solve the problems of news representation and similarity measurement. • Our method has great significance for topic detection from news. Topic detection aims to discover valuable topics from the massive online news. It can help people to capture what is happening in real world and alleviate the burden of information overload. It also has great significance since the online news is experiencing an explosive growth. Topic detection is typically transformed into a document clustering problem, whose core idea is to cluster news documents that report on the same topic to the same group based on document similarity. Due to the complex structure and long length of news documents, the similarity measurement of news is very challenging. Existing term-based methods represent news documents based on a set of informative keywords in the document with a vector space model (VSM) and then the relationship between documents is calculated by cosine similarity . However, VSM ignores the relationship between words and has sparse semantics, which leads to low precision of topic detection. In recent years, the probabilistic methods and the graph analytical methods have been proposed for topic detection. However, both of them have high time complexity. To cope with these problems, we first present a novel document representation approach based on graphical decomposition, which decomposes each news document into different semantic units and then relationship between the semantic units is constructed to form a capsule semantic graph (CSG). The CSG can retain the relationship between words and alleviate the sparse semantics compared to VSM representation. We next introduce the graph kernel to measure the similarity between the CSGs based on their substructures. Finally, we use an incremental clustering method to cluster the news documents, in which the documents are represented by CSGs and the similarity between documents is calculated by graph kernel. The experiment results on three standard datasets show that our method obtains higher precision, recall and F1 score than several state-of-the-art methods. Moreover, the experiment results on a large news dataset show that our CSG-SM has lower time complexity than probabilistic methods and graph analytical methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call