Abstract

In real-world application scenarios, the identification of groups poses a significant challenge due to possibly occurring outliers and existing noise variables. Therefore, there is a need for a clustering method which is capable of revealing the group structure in data containing both outliers and noise variables without any pre-knowledge. In this paper, we propose a k-means-based algorithm incorporating a weighting function which leads to an automatic weight assignment for each observation. In order to cope with noise variables, a lasso-type penalty is used in an objective function adjusted by observation weights. We finally introduce a framework for selecting both the number of clusters and variables based on a modified gap statistic. The conducted experiments on simulated and real-world data demonstrate the advantage of the method to identify groups, outliers, and informative variables simultaneously.

Highlights

  • The identification of groups in real-world high-dimensional datasets reveals challenges due to several aspects: (1) the presence of outliers; (2) the presence of noise variables; (3) the selection of proper parameters for the clustering procedure, e.g. the number of clusters

  • We evaluate the performance of the proposed method in terms of the clustering solution, outlier detection, and the identification of informative variables

  • The clustering solution is evaluated based on the Classification Error Rate (CER), used by Witten and Tibshirani (2010)

Read more

Summary

Introduction

The identification of groups in real-world high-dimensional datasets reveals challenges due to several aspects: (1) the presence of outliers; (2) the presence of noise variables; (3) the selection of proper parameters for the clustering procedure, e.g. the number of clusters. In any large and high-dimensional complex dataset, outliers and noise variables are very likely to appear. A clustering method needs to be designed in such a way that both aspects are taken into account, no matter if outliers are considered as highly interesting observations due to their typically different content or just as noise. The data complexity in terms of the number of groups and the proportion of outliers as well as the number of noise variables very much depends on the dataset itself. A clustering procedure should ideally be data-independent. The goal of this paper is to introduce a clustering method designed for such an application scenario

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.