Abstract

We solved the problem of development of algorithmic software for processes of content monitoring for solving the problem of recognition of the style of an author of a Ukrainian text based on Web Mining and NLP technology. Decomposition of the method for recognition of the style of an author, based of analysis of the found stop words, was carried out. Specific features of the method include adaptation of morphological and syntactic analysis of lexical units to structural peculiarities of words/ texts in Ukrainian. It is syntactic words (stop words or anchor words) that are significant for an author’s individual style, as they are not related to the theme and content of the publication. Recognition of the author's style is based on analysis of coefficients of lexical author’s language: coherence of speech, lexical diversity, syntactic complexity indices of concentration and exclusivity for the author's fragment. They are used for subsequent comparison and determining of a degree of belonging of the analyzed text to a particular author. We studied internal dynamics of a text of randomly selected authors through analysis of coefficients of lexical author’s language for the first k, n and m (without the title) words of the author's fragment and the analyzed one. The obtained results were compared. We obtained results of experimental testing of the proposed method for content-monitoring for determining and analysis of stop words in Ukrainian scientific texts of technical area based on Web Mining technology. It was found that for the selected experimental base that contains 100 works, the method for analysis of an article without compulsory initial information and list of references attains the best results by density criterion. It is achieved through learning of the system and by checking specified blocked words and specified thematic vocabulary. Testing of the proposed method for determining of keywords from other categories of texts – of scientific humanitarian area, belles-lettres, journalistic, etc. – requires subsequent experimental research.

Highlights

  • The impetus of research into statistical linguistic was the emergence and active development of information technologies (IT) in the area of NLP and Web Mining [1]

  • Results of research into the author’s style in the Ukrainian texts based on technology of statistical linguistics scientific publications from two issues (783 and 805) of the Visnyk of the National University “Lviv Polytechnika” from a series “Information systems and networks” were analized

  • We developed the method of recognition of the style of the text’s author based on coefficients of lexical author’s language in the reference fragment of the author’s text

Read more

Summary

Introduction

The impetus of research into statistical linguistic (quantitative linguistics) was the emergence and active development of information technologies (IT) in the area of NLP and Web Mining [1]. Potebnya of the Academy of Science of the USSR, a group of structural and mathematical linguistics was organized [2] It began a straightforward statistical research into Ukrainian texts of belles-lettres, scientific-technical and socio-political functional styles. The major trend of applied statistical linguistics and sciences, related to it, is development of methods and technologies for determining the statistical structure of a text for solving problems, in particular, of linguometry [4], stylemetry [5], and glottochronology [6] These problems include, for example, automation of lexicographic processes, comparison of dictionaries, creation of shorthand systems, and automatic recognition of a language [7]. For Ukrainian texts, it was found that statistical parameters of styles include frequencies of vowels, consonants, spaces between words, as well as palatelized and resonant groups of consonants

Literature review and problem statement
The goal and objectives of research
Method of determining style of the text content’s author
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call