Abstract

A hierarchical scheme for clustering data is presented which applies to spaces with a high number of dimensions (). The data set is first reduced to a smaller set of partitions (multi-dimensional bins). Multiple clustering techniques are used, including spectral clustering; however, new techniques are also introduced based on the path length between partitions that are connected to one another. A Line-of-Sight algorithm is also developed for clustering. A test bank of 12 data sets with varying properties is used to expose the strengths and weaknesses of each technique. Finally, a robust clustering technique is discussed based on reaching a consensus among the multiple approaches, overcoming the weaknesses found individually.

Highlights

  • Clustering is a fundamental technique and methodology in data analysis and machine learning

  • All clustering techniques plus the four robust consensus results of each test case would be presented, leading to 360 figures, but due to space limitations, the full set of clustering results are provided in the supplemental material

  • Due to the number of variations in spectral clustering, the techniques are identified by an index given in Table 3, while in the text, a shorthand will be used: ([1] [2], NN1, 2DHIST) to represent the use of the 1st and 2nd eigenvectors, utilizing a Laplacian based on an adjacency matrix derived from the first nearest neighbor matrix NN1 and gathering the partitions into clusters within the eigenspace using a 2D histogram (6 × 6) bins

Read more

Summary

Introduction

Clustering is a fundamental technique and methodology in data analysis and machine learning. The explosion of the field of data science has, led to an expansion in how this notion is applied. In this respect, it would be more appropriate to refer to clustering as data organization, which would encompass the ideas of 1) data reduction, 2) data identification, 3) data clustering, and 4) data grouping. Data reduction is the process of converting raw data into a form that is more amenable for the application of a specific analytical and/or computational methodology. Data clustering is the process of associating data through proximity, similarity, or dissimilarity.

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.