Preliminary data classification by multilevel segmentation of histograms for clustering of hypercubes

Roman Melnyk,Roman Kvit,Ruslan Tushnytskyy,Tetyana Salo

doi:10.15587/2706-5448.2020.220428

Abstract

The object of research is an algorithm for the classification of large data based on the hierarchical clustering algorithm. The nonlinear complexity of the clustering algorithm does not allow for data samples of 5–10 thousand and above. To classify data, it is necessary to pre-group them. Therefore, the subject of research is the data segmentation algorithm based on piecewise linear approximation.In the course of the study, let’s use hierarchical clustering algorithms, the method of piecewise linear approximation of the cumulative histogram, calculated by a special procedure, and the procedure for searching for segmentation thresholds.The computational complexity of the classical hierarchical algorithm reaches the value of O(N3), and certain steps to limit the search can achieve the value of O(N2), which is confirmed by experiments to study the dependence of the hierarchical tree on the initial sample. An approximate approach to key clustering with partitioning of a set of basic keys is implemented. To reduce further the complexity of the hierarchical clustering algorithm, a decomposition approach based on splitting the initial sample of large data into a number of subsets is proposed. In this article to use the hierarchical clustering algorithm for big data classification the preliminary decomposition method is proposed. It is based on multilevel segmentation of cumulative or ordinary histograms obtained for every feature coordinate characterizing object of data. Thresholds of multilevel segmentation are obtained by piecewise linear approximation of histogram functions. Build hypercubes of data are being accepted as objects for three stages clustering algorithm.Powerful tool for data classification is obtained, the use of which allows carrying out many experiments with data of various types to identify patterns among the data features. Its application is intended for the processing of patient data, molecular structures, economic problems for making optimal treatment decisions, diagnostics and modeling. Thanks to this approach, data classification can be performed online to obtain the results of direct analysis when data is received, for example, from spacecraft.

Full Text