Abstract
A hierarchical scheme for clustering data is presented which applies to spaces with a high number of dimensions (). The data set is first reduced to a smaller set of partitions (multi-dimensional bins). Multiple clustering techniques are used, including spectral clustering; however, new techniques are also introduced based on the path length between partitions that are connected to one another. A Line-of-Sight algorithm is also developed for clustering. A test bank of 12 data sets with varying properties is used to expose the strengths and weaknesses of each technique. Finally, a robust clustering technique is discussed based on reaching a consensus among the multiple approaches, overcoming the weaknesses found individually.
Highlights
Clustering is a fundamental technique and methodology in data analysis and machine learning
All clustering techniques plus the four robust consensus results of each test case would be presented, leading to 360 figures, but due to space limitations, the full set of clustering results are provided in the supplemental material
Due to the number of variations in spectral clustering, the techniques are identified by an index given in Table 3, while in the text, a shorthand will be used: ([1] [2], NN1, 2DHIST) to represent the use of the 1st and 2nd eigenvectors, utilizing a Laplacian based on an adjacency matrix derived from the first nearest neighbor matrix NN1 and gathering the partitions into clusters within the eigenspace using a 2D histogram (6 × 6) bins
Summary
Clustering is a fundamental technique and methodology in data analysis and machine learning. The explosion of the field of data science has, led to an expansion in how this notion is applied. In this respect, it would be more appropriate to refer to clustering as data organization, which would encompass the ideas of 1) data reduction, 2) data identification, 3) data clustering, and 4) data grouping. Data reduction is the process of converting raw data into a form that is more amenable for the application of a specific analytical and/or computational methodology. Data clustering is the process of associating data through proximity, similarity, or dissimilarity.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of Data Analysis and Information Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.