МЕТОДЫ И АЛГОРИТМЫ КЛАСТЕРИЗАЦИИ ТЕКСТОВЫХ ДАННЫХ (ОБЗОР)

V V Bova,S I Rodzin,Y.A Kravchenko

doi:10.18522/2311-3103-2022-4-122-143

Abstract

The article deals with one of the important tasks of artificial intelligence – machine processingof natural language. The solution of this problem based on cluster analysis makes it possibleto identify, formalize and integrate large amounts of linguistic expert information under conditionsof information uncertainty and weak structure of the original text resources obtained fromvarious subject areas. Cluster analysis is a powerful tool for exploratory analysis of text data,which allows for an objective classification of any objects that are characterized by a number offeatures and have hidden patterns. A review and analysis of modern modified algorithms for agglomerativeclustering CURE, ROCK, CHAMELEON, non-hierarchical clustering PAM, CLARAand the affine transformation algorithm used at various stages of text data clustering, the effectivenessof which is verified by experimental studies, is carried out. The paper substantiates therequirements for choosing the most efficient clustering method for solving the problem of increasing the efficiency of intellectual processing of linguistic expert information. Also, the paper considersmethods for visualizing clustering results for interpreting the cluster structure and dependencieson a set of text data elements and graphical means of their presentation in the form ofdendograms, scatterplots, VOS similarity diagrams, and intensity maps. To compare the quality ofthe algorithms, internal and external performance metrics were used: "V-measure", "AdjustedRand index", "Silhouette". Based on the experiments, it was found that it is necessary to use ahybrid approach, in which, for the initial selection of the number of clusters and the distribution oftheir centers, use a hierarchical approach based on sequential combining and averaging the characteristicsof the closest data of a limited sample, when it is not possible to put forward a hypothesisabout the initial number of clusters. Next, connect iterative clustering algorithms that providehigh stability with respect to noise features and the presence of outliers. Hybridization increasesthe efficiency of clustering algorithms. The research results showed that in order to increase thecomputational efficiency and overcome the sensitivity when initializing the parameters of clusteringalgorithms, it is necessary to use metaheuristic approaches to optimize the parameters of thelearning model and search for a global optimal solution.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

МЕТОДЫ И АЛГОРИТМЫ КЛАСТЕРИЗАЦИИ ТЕКСТОВЫХ ДАННЫХ (ОБЗОР)

Abstract

Talk to us

Similar Papers

More From: IZVESTIYA SFedU. ENGINEERING SCIENCES

Lead the way for us

Similar Papers

Effects of some design factors on the distribution of similarity indices in cluster analysis
Ahmed N Albatineh ... Golam B M Kibria
Communications in Statistics - Simulation and Computation | VOL. 46
Ahmed N Albatineh, et. al.Ahmed N Albatineh ... Golam B M Kibria
23 Oct 2015
Communications in Statistics - Simulation and Computation | VOL. 46

Distinct Neural Resource Involvements but Similar Hemispheric Lateralization Patterns in Pre-Attentive Processing of Speaker's Identity and Linguistic Information.
Shuqi Yin ... Lang Xie
Brain sciences | VOL. 13
Shuqi Yin, et. al.Shuqi Yin ... Lang Xie
23 Jan 2023
Brain sciences | VOL. 13

Attention Modulates the Role of Speakers' Voice Identity and Linguistic Information in Spoken Word Processing: Evidence From Event-Related Potentials.
Yunxiao Ma ... Shuqi Yin
Journal of speech, language, and hearing research : JSLHR | VOL. 66
Yunxiao Ma, et. al.Yunxiao Ma ... Shuqi Yin
18 Apr 2023
Journal of speech, language, and hearing research : JSLHR | VOL. 66

An Improved Data Clustering Algorithm for Mining Web Documents
O H Odukoya ... G A Aderounmu
-
O H Odukoya, et. al.O H Odukoya ... G A Aderounmu
01 Dec 2010
01 Dec 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

МЕТОДЫ И АЛГОРИТМЫ КЛАСТЕРИЗАЦИИ ТЕКСТОВЫХ ДАННЫХ (ОБЗОР)

Abstract

Talk to us

Similar Papers

More From: IZVESTIYA SFedU. ENGINEERING SCIENCES