Fusion News Elements of News Text Similarity Calculation

Hongbin Wang,Zhongxi Hou,Liguo Fan,Jingzhen Ye

doi:10.1007/978-3-030-00214-5_66

Abstract

Text similarity is an effective metric for estimating the text matching degree between two or more texts. Vector Space Model (VSM) is employed for the text similarity calculation in most instances. However, it is insufficient and ill-suited to solve the complex tasks since the high calculation dimension and computational complexity. Therefore, it is crucial to calculate the similarity of two news text, so that whether two reported news is the identical event or the same type of information would be ascertained. According to the analysis of the news reports, five basic factors in terms of “when”, “where”, “what”, “why”, and “who” are taken into account for distinguishing a news report. By analyzing these features, in this study, a method to calculate the similarity of news text is proposed. The proposed method fully integrates the influence of the five news feature words into the evaluation of text similarity, which avoids the problem happened in the text interference and computational efficiency to a large extent. There are four steps to execute the proposed method, i.e. extraction of the news elements, classification of these elements, calculation of the similarity, and comparison with available literatures. Experimental results suggest that our proposal outperforms the vector space cosine coefficient method, Jaccard coefficient method and entropy method in terms of the time complexity and computational accuracy.

Full Text