DEVELOPMENT OF A NATURAL LANGUAGE PROCESSING TOOL FOR SOLVING THE APPLICATION PROBLEM OF EXTRACTING STATISTICAL DATA FROM TEXT

doi:10.18469/ikt.2024.22.1.13

Open Access

PDF Available

https://doi.org/10.18469/ikt.2024.22.1.13

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Text analytics is used to explore textual content and obtain new variables from raw text, which can be used as input data for forecasting models or other statistical methods, including for solving fun-damental problems. The purpose of the research: to analyze machine learning algorithms, practical developments in this field and to develop an integrated software instrument for text processing, us-ing the structure of the algorithm, based on the BasicStats, ReadabilityStats, SovChLit libraries, al-lowing to extract statistics from raw texts of large volumes in Russian. A method of extracting sta-tistical data from raw texts of large volumes based on machine learning and natural language pro-cessing in Python has been implemented, with the possibility of embedding it into other projects. A software instrument that use the functionality of textary library adapted for Russian language was developed, which allows to work with both texts and Doc-objects generated with spaCY library. The study was conducted using real text data collected from the information and news portal for the Samara region «63.ru» (in the context of the implementation of the conceptual project «Data Farm» by the artificial intelligence research laboratory). The developed software for extracting statistical data from text allows analyzing large volumes of text data and extracting useful information from them. It can be integrated into other software solutions as one of the linking modules in the of code optimization chain for text data processing programs.

Full Text