Establishing Semantic Similarity of the Cluster Documents and Extracting Key Entities in the Problem of the Semantic Analysis of News Texts

Anastasia Nikolaevna Soloshenko,Vladimir Leonidovich Rozaliev,Alla Viktorovna Zaboleeva-Zotova,Yulia Aleksandrovna Orlova

doi:10.5539/mas.v9n5p246

Anastasia Nikolaevna Soloshenko, Vladimir Leonidovich Rozaliev + Show 2 more

Open Access

https://doi.org/10.5539/mas.v9n5p246

Copy DOI

Abstract

This paper is dedicated to the problem of establishing semantic similarity for the documents of the news cluster and extracting key entities from the article's text. The existing methods and algorithms for fuzzy duplicate detection texts are briefly reviewed and analysed, such as TF-IDF and its modifications, Long Sent, Megashingles and Log Shingles, and Lex Rand. The shingles algorithm essence and its main stages are described in detail. Several options of the parallel implementation for the shingles algorithm are presented: for multiprocessor heterogeneous computing systems using CUDA and Open CL and for distributed computing systems using Google App Engine. The parameters of the algorithm (operation time, acceleration) applied to the problem of the semantic analysis for news texts are assessed. In addition, the methods and algorithms for extracting key phrases from the news text are reviewed: graph methods, in particular TextRank, building horizontal visibility graphs, the Viterbi algorithm, types of Markov random fields method, as well as a comprehensive context-sensitive algorithm for news text analysis (a combination of statistical algorithms for extracting key words and algorithms for forming semantic coherence of the text blocks). These methods are analysed from the standpoint of applicability to the news articles analysis. Particular attention is paid to the peculiarities of the news text structure. Although the thematic classification and selection of key entities in text documents are powerful text processing tools, these stages of analysis cannot give a complete picture of the news piece semantics. The paper presents a methodology and a comprehensive analysis of news text, based on a combination of semantic analysis and subsequent text abstracting submitting it in a compressed format - so-called mind map.

Highlights

1.1 Introduce the ProblemThe problem of establishing semantic similarity for cluster documents and selection of entities that make up the information structure of the text, is one of the most important and difficult problems in the web data analysis and information retrieval on the Internet
This involves improving the quality of archives belonging to search engines by removing redundant information, and associating news reports in the stories based on similarity in content of these messages in the semantic analysis task, spam filtering, establishment of copyright infringement in the illicit copying of information
The sequential code and OpenCL - the average acceleration is 10.12, and, to the Cuda option, text normalization is not performed in parallel

Summary

Introduction

The problem of establishing semantic similarity for cluster documents and selection of entities that make up the information structure of the text, is one of the most important and difficult problems in the web data analysis and information retrieval on the Internet The urgency of this problem is determined by a variety of applications requiring contemplation of the news documents semantic component. Efficient technologies for automated analysis of the information provided in natural language, are of particular interest for many organizations (news feeds, information and library systems, etc.), and for individuals (network users) In this regard, it is necessary to research the structure of the news text, as well as the methods for its analysis. The big part of attention is paid to the development of methods on reduction the computational complexity of the created algorithms

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Modern Applied Science	Publication Date: Apr 7, 2015
Citations: 7	License type: cc-by

R Discovery Prime

R Discovery Prime

Establishing Semantic Similarity of the Cluster Documents and Extracting Key Entities in the Problem of the Semantic Analysis of News Texts

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Modern Applied Science

Lead the way for us

Similar Papers

News Image-Text Matching With News Knowledge Graph
Zhao Yumeng ... Gao Shuo
IEEE Access | VOL. 9
Zhao Yumeng, et. al.Zhao Yumeng ... Gao Shuo
01 Jan 2020
IEEE Access | VOL. 9

News Headline as a Form of News Text Compression
Nataliya Kochetkova ... Elena Yagunova
-
Nataliya Kochetkova, et. al.Nataliya Kochetkova ... Elena Yagunova
01 Jan 2018
01 Jan 2018

Multimodal news analytics using measures of cross-modal entity and context consistency
Eric Müller-Budack ... Sebastian Diering
International Journal of Multimedia Information Retrieval | VOL. 10
Eric Müller-Budack, et. al.Eric Müller-Budack ... Sebastian Diering
28 Apr 2021
International Journal of Multimedia Information Retrieval | VOL. 10

An End-to-end Hierarchical Multi-task Learning Framework of Sentiment Analysis and Key Entity Identification for Online Financial Texts
Xinhao Zheng ... Weijian Zhang
-
Xinhao Zheng, et. al.Xinhao Zheng ... Weijian Zhang
05 May 2021
05 May 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Establishing Semantic Similarity of the Cluster Documents and Extracting Key Entities in the Problem of the Semantic Analysis of News Texts

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Modern Applied Science