Component-based Feature Saliency for Clustering

Xin Hong,Jianjiang Zhou,Huiyu Zhou,Paul Miller,Xuelong Li,Hailin Li,Ling Li,Danny Crookes,Yonggang Lu

doi:10.1109/tkde.2019.2936847

Abstract

Simultaneous feature selection and clustering is a major challenge in unsupervised learning. In particular, there has been significant research into saliency measures for features that result in good clustering. However, as datasets become larger and more complex, there is a need to adopt a finer-grained approach to saliency by measuring it in relation to a part of a model. Another issue is learning the feature saliency and advanced model parameters. We address the first by presenting a novel Gaussian mixture model, which explicitly models the dependency of individual mixture components on each feature giving a new component-based feature saliency measure. For the second, we use Markov Chain Monte Carlo sampling to estimate the model and hidden variables. Using a synthetic dataset, we demonstrate the superiority of our approach, in terms of clustering accuracy and model parameter estimation, over an approach using a model-based feature saliency with expectation maximisation. We performed an evaluation of our approach with six synthetic trajectory datasets obtaining an average clustering accuracy of 97 percent. To demonstrate the generality of our approach, we applied it to a network traffic flow dataset obtaining an accuracy of 93 percent for intrusion detection. Finally, we performed a comparison with state-of-the-art clustering techniques using three real-world trajectory datasets of vehicle traffic. Our approach achieved an average clustering accuracy of 96 percent compared to 77-95 percent for the other techniques. In conclusion, for the datasets considered, component based feature saliency measures gave improved clustering over those based on whole models.

Highlights

Clustering is one of the most fundamental approaches in data analysis
We address the first by presenting a novel Gaussian mixture model, which explicitly models the dependency of individual mixture components on each feature giving a new component-based feature saliency measure
We demonstrate the superiority of our approach, in terms of clustering accuracy and model parameter estimation, over an approach using a model-based feature saliency with expectation maximisation

Summary

Introduction

Clustering is one of the most fundamental approaches in data analysis. It discovers structure in data by organising it into homogeneous groups where the within-groupobject similarity is maximised and the between-group-object similarity is minimised [1]. The vast majority of early approaches were distance-based algorithms in which some distance measure is defined to govern partitioning tasks. Challenges facing this group are that of uniform distances in high-dimensional data [2], as well as the curse of dimensionality. Finite mixture models have been widely used to provide a formal framework for clustering. These methods take advantage of their natural capacity to represent heterogeneity. This latter group face issues, including the choice of the statistical distribution, the learning algorithm for the mixture’s parameters estimation, the number of clusters, and feature selection in high dimensional problems [3]

Objectives

Methods

Findings

Conclusion