Weighted Cluster Ensemble Based on Partition Relevance Analysis With Reduction Step

Nejc Ilc

doi:10.1109/access.2020.3003046

Abstract

Over the last decade, the advent of the cluster ensemble framework has enabled more accurate and robust data analysis than traditional single clustering algorithms. The improved clustering of microarray data has had a particularly strong impact in the fields of genomics and medicine. However, when we bring several ensemble members together to form a consensus, low-quality data partitions can seriously compromise the final solution. One way to overcome this problem is the weighted cluster ensemble approach based on Partition Relevance Analysis (PRA), which uses internal cluster validity indices to evaluate and weight the ensemble members before the fusion. Unfortunately, the selection of appropriate validation indices for given data is far from trivial. In this paper, we propose an additional step in PRA that reduces the size of the committee of cluster validation indices. It does so by eliminating redundant and noisy indices using data dimensionality reduction methods. Our extension works in an unsupervised way, minimizing the amount of user intervention and required expert knowledge. We adapted three conventional consensus functions based on the principle of evidence accumulation to work with PRA weights. We demonstrate the advantages of the proposed reduction step of PRA based on extensive experiments with 25 gene expression and 15 non-genetic real-world datasets, where we compared 15 consensus functions. The source code is available at https://github.com/nejci/PRAr.

Highlights

Cluster analysis is a fundamental tool in the fields of machine learning, pattern recognition, and data mining [1], where an efficient analysis, visualization, and interpretation of the data is essential, mainly due to the constant growth of data volume
We propose an enhancement of Partition Relevance Analysis (PRA) with the additional reduction step (PRAr) that reduces the number of cluster validity indices (CVIs) while preserving the most informative ones without the intervention of the user and with no labeled data required
WITH DISCUSSION How do the consensus functions perform on the selected datasets? What are the probabilities that our proposed algorithms using PRAr are better than others? Which PRAr configuration is the best for a particular data type and consensus function on average? What are the relations between unification, reduction, and aggregation functions in terms of performance? To answer those questions, we defined two evaluation protocols for computing the performance score of a consensus function:

Summary

Introduction

Cluster analysis is a fundamental tool in the fields of machine learning, pattern recognition, and data mining [1], where an efficient analysis, visualization, and interpretation of the data is essential, mainly due to the constant growth of data volume. Clustering is a process of organizing data into natural groups or clusters, such that similar data points are assigned to the same cluster [2]. Data clustering is an unsupervised learning task, meaning that the number of clusters is unknown, and none of the input data points are labeled. Applications of clustering include image segmentation [3], text mining [4], gene expression analysis [5], air pollution analysis [6], and fault diagnosis [7], to name only a few.

Objectives

Methods

Results

Conclusion