Development of a method for determining the keywords in the slavic language texts based on the technology of web mining

Vasyl Lytvyn,Oksana Brodyak,Petro Pukach,Dmytro Ugryn,Victoria Vysotska

doi:10.15587/1729-4061.2017.98750

Abstract

The authors accomplished the task of development of algorithmic support of processes of the content monitoring for solving the problem of defining the keywords of a Slavic language text based on Web Mining technology. Substantiation of peculiarities of its use for defining keywords and subject heading of the text content was considered. Web Mining technology allows us to take advantage of the text content monitoring method based on the Porter’s stemmer to solve the problem on determining the keywords. Stemming modification is based on the well-known classification of morpheme and word formation structure of derivatives of the Ukrainian language, revealing patterns of affixes combination, modeling the structural organization of verbs and suffixed nouns. Algorithms of morphonological modifications in the process of verb word changing and adjective word changing and word formation in the Ukrainian language were used. Decomposition of the method of determining keywords of the text content was performed. Its features include adaptation of morphological and syntactic analysis of lexical units to peculiarities of Ukrainian words/text structures. Algorithm support of its main structural components was developed. Its features include convolution and analysis of a nominal/verb group and construction of appropriate trees of analysis for each sentence, taking into account the features of their structures as elements of the Slavic language texts. The formal approach to the implementation of stemming of a Ukrainian language text was proposed. It is aimed at automatic detection of notional keywords of a Ukrainian text due to the proposed formal approach to implementation of stemming for the Ukrainian language content. Theoretically, the ways of enhancing efficiency of the keywords search, in particular their density in the text, were found. They are based on an analysis of not the words themselves (nouns, a set of nouns, adjectives with nouns, other parts of speech are ignored), but rather of word stems in Slavic language texts. The rules of stem separations in texts consider not only the isolation of inflexions, but also suffixes, as well as registering the letter alternation during declension of nouns and adjectives. Based on the developed software, we received the results of experimental testing of the proposed content monitoring method for defining keywords in Slavic language scientific texts of technical area based on the Web Mining technology. It was found that for the selected experimental base of 100 works, the best results according to density criterion are achieved by the method of article analysis without compulsory initial information and a list of literature. This is attained through training the system and by checking the refined blocked words and refined thematic dictionary. It was also discovered that for technical scientific texts of the experimental base, the best results are reached by the method of article analysis without beginning (title, authors, UDC, abstracts in two languages, author’s keywords in two languages, work place of authors) and without a list of literature with the check of specified blocked words and refined thematic dictionary – for it the average value of keywords density in the text reaches 0.34, which is by 81 % higher than the correspondent value of density of the original text, which makes 0.19. By numerous data of statistical analysis, it was proved that setting parameters of the system increases the number of defined keywords almost by 2 times without decreasing the indicator of accuracy and reliability. Testing of the proposed method for determining keywords from other categories of texts, such as scientific humanitarian, fiction, journalistic, require further experimental research.

Highlights

Web Mining technology provides obtaining valuable knowledge from a text of information resources from the Internet sources
It is necessary to provide the relevance of keywords in the text content of Web-resources to the keywords, applied by the search engines users
Web Mining allows us to take into account the opinion of potential users and the target audience in general for the formation of Web-resource content, its optimization and further promotion engaging the potential audience

Summary

METHOD FOR DETERMINING

D. Ugryn PhD, Associate Professor Department of Information Systems Chernivtsi Department National Technical University "Kharkiv Polytechnic Institute" Holovna str., 203 A, Chernivtsi, Ukraine, 58000. **Department of Mathematics*** ***Lviv Polytechnic National University S.

Introduction

Literature review and problem statement

Peculiarities of lexical analysis of the Slavic language texts

The aim and tasks of research

Method of determining the keywords in text content

Results of examining the keywords of text content

To analyze all the articles with the check of general

Findings

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Eastern-European Journal of Enterprise Technologies	Publication Date: Apr 26, 2017
Citations: 19	License type: cc-by

R Discovery Prime

R Discovery Prime

Development of a method for determining the keywords in the slavic language texts based on the technology of web mining

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Eastern-European Journal of Enterprise Technologies

Lead the way for us

Similar Papers

Development of a method for the recognition of author’s style in the Ukrainian language texts based on linguometry, stylemetry and glottochronology
Vasyl Lytvyn ... Victoria Vysotska
Eastern-European Journal of Enterprise Technologies | VOL. 4
Vasyl Lytvyn, et. al.Vasyl Lytvyn ... Victoria Vysotska
24 Aug 2017
Eastern-European Journal of Enterprise Technologies | VOL. 4

Analysis of statistical methods for stable combinations determination of keywords identification
Vasyl Lytvyn ... Mariya Hrendus
Eastern-European Journal of Enterprise Technologies | VOL. 2
Vasyl Lytvyn, et. al.Vasyl Lytvyn ... Mariya Hrendus
16 Mar 2018
Eastern-European Journal of Enterprise Technologies | VOL. 2

Building an Associative Classification Data Model Based on the Apriori Method
K V Mulyukova ... V M Kureichik
Open Education | VOL. 24
K V Mulyukova, et. al.K V Mulyukova ... V M Kureichik
05 Sep 2020
Open Education | VOL. 24

Підхід до виявлення аномалій в потоках тектових даних
Elena Afanasyeva ... Yuriy Oliynyk
System technologies | VOL. 2
Elena Afanasyeva, et. al.Elena Afanasyeva ... Yuriy Oliynyk
24 Feb 2020
System technologies | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Development of a method for determining the keywords in the slavic language texts based on the technology of web mining

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Eastern-European Journal of Enterprise Technologies