Abstract
This paper is dedicated to the problem of establishing semantic similarity for the documents of the news cluster and extracting key entities from the article's text. The existing methods and algorithms for fuzzy duplicate detection texts are briefly reviewed and analysed, such as TF-IDF and its modifications, Long Sent, Megashingles and Log Shingles, and Lex Rand. The shingles algorithm essence and its main stages are described in detail. Several options of the parallel implementation for the shingles algorithm are presented: for multiprocessor heterogeneous computing systems using CUDA and Open CL and for distributed computing systems using Google App Engine. The parameters of the algorithm (operation time, acceleration) applied to the problem of the semantic analysis for news texts are assessed. In addition, the methods and algorithms for extracting key phrases from the news text are reviewed: graph methods, in particular TextRank, building horizontal visibility graphs, the Viterbi algorithm, types of Markov random fields method, as well as a comprehensive context-sensitive algorithm for news text analysis (a combination of statistical algorithms for extracting key words and algorithms for forming semantic coherence of the text blocks). These methods are analysed from the standpoint of applicability to the news articles analysis. Particular attention is paid to the peculiarities of the news text structure. Although the thematic classification and selection of key entities in text documents are powerful text processing tools, these stages of analysis cannot give a complete picture of the news piece semantics. The paper presents a methodology and a comprehensive analysis of news text, based on a combination of semantic analysis and subsequent text abstracting submitting it in a compressed format - so-called mind map.
Highlights
1.1 Introduce the ProblemThe problem of establishing semantic similarity for cluster documents and selection of entities that make up the information structure of the text, is one of the most important and difficult problems in the web data analysis and information retrieval on the Internet
This involves improving the quality of archives belonging to search engines by removing redundant information, and associating news reports in the stories based on similarity in content of these messages in the semantic analysis task, spam filtering, establishment of copyright infringement in the illicit copying of information
The sequential code and OpenCL - the average acceleration is 10.12, and, to the Cuda option, text normalization is not performed in parallel
Summary
The problem of establishing semantic similarity for cluster documents and selection of entities that make up the information structure of the text, is one of the most important and difficult problems in the web data analysis and information retrieval on the Internet The urgency of this problem is determined by a variety of applications requiring contemplation of the news documents semantic component. Efficient technologies for automated analysis of the information provided in natural language, are of particular interest for many organizations (news feeds, information and library systems, etc.), and for individuals (network users) In this regard, it is necessary to research the structure of the news text, as well as the methods for its analysis. The big part of attention is paid to the development of methods on reduction the computational complexity of the created algorithms
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.