This article discusses and presents the main problems and principles of the data clusteringprocess, in particular, the principles and tasks of clustering text arrays of linguistic expert information.In the course of this work, the main difficulties arising in the design of such systems wereidentified, for example: the need for preprocessing data, reducing the size of the initial sample,etc. To effectively perform the presented tasks, the implemented solution must have an integratedapproach that takes into account the efficiency indicators of methods aimed at solving individualsubtasks, as well as the ability to provide high efficiency indicators for the implementation of eachstage of the clustering process. In the presented work, various groups of hierarchical clusteringalgorithms are considered, in particular, a subgroup of agglomerative clustering algorithms wasconsidered in relation to the problems of clustering linguistic expert information. In the describedwork, a formal statement of the text clustering problem is given, and the main group of implementedsolutions based on the principles of agglomerative clustering is determined: ROCK, CURE,CHAMELEON. A detailed review of each of the presented algorithms is carried out, and the mainadvantages and disadvantages of each of them are formulated. The advantage of this work can beconsidered the totality of the presented data on the algorithms, as well as the results of a comparative analysis, which make it possible to further assess the feasibility and potential probability ofusing these solutions from the presented group of agglomerative clustering algorithms. The noveltyof this work lies in the formation of an overview analysis of existing approaches in the field ofhierarchical clustering for solving the problems of cluster analysis of linguistic expert information,as well as the formation of the results of the comparative analysis of the considered algorithms.
Read full abstract