Abstract

Cluster analysis in spectroscopy presents some unique challenges due to the specific data characteristics in spectroscopy, namely, high dimensionality and small sample size. In order to improve cluster analysis outcomes, feature selection can be used to remove redundant or irrelevant features and reduce the dimensionality. However, for cluster analysis, this must be done in an unsupervised manner without the benefit of data labels. This paper presents a novel feature selection approach for cluster analysis, utilizing clusterability metrics to remove features that least contribute to a dataset's tendency to cluster. Two versions are presented and evaluated: The Hopkins clusterability filter which utilizes the Hopkins test for spatial randomness and the Dip clusterability filter which utilizes the Dip test for unimodality. These new techniques, along with a range of existing filter and wrapper feature selection techniques were evaluated on eleven real-world spectroscopy datasets using internal and external clustering indices. Our newly proposed Hopkins clusterability filter performed the best of the six filter techniques evaluated. However, it was observed that results varied greatly for different techniques depending on the specifics of the dataset and the number of features selected, with significant instability observed for most techniques at low numbers of features. It was identified that the genetic algorithm wrapper technique avoided this instability, performed consistently across all datasets and resulted in better results on average than utilizing the all the features in the spectra.

Highlights

  • 1.1 Cluster Analysis in SpectroscopyCluster analysis is an unsupervised machine learning technique aimed at generating knowledge from unlabeled data [1]

  • The feature selection techniques from the filter methods, wrapper methods, and the newly proposed clusterability filter methods were applied to the explosives spectroscopy datasets and the public spectroscopy datasets

  • A positive score showed that the feature selection technique resulted in an improved Silhouette index (SI) score compared to applying no feature selection

Read more

Summary

Introduction

1.1 Cluster Analysis in SpectroscopyCluster analysis is an unsupervised machine learning technique aimed at generating knowledge from unlabeled data [1]. While cluster analysis is commonly used for data exploration, there are other circumstances where it is valuable such as when the class structure is known to vary with time, or the cost of acquiring classified (labeled) samples might be too great [3]. Much of the feature selection literature from the chemometric domain focuses on applications of classification and regression and the associated calibration using well proven techniques such as partial least squares (PLS) and principal component regression (PCR) [8,9,10]. It has not been demonstrated if feature selection methods associated with these techniques are applicable for cluster analysis

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call