SMART: unique splitting-while-merging framework for gene clustering.

Rui Fa,David J Roberts,Asoke K Nandi

doi:10.1371/journal.pone.0094141

Abstract

Successful clustering algorithms are highly dependent on parameter settings. The clustering performance degrades significantly unless parameters are properly set, and yet, it is difficult to set these parameters a priori. To address this issue, in this paper, we propose a unique splitting-while-merging clustering framework, named “splitting merging awareness tactics” (SMART), which does not require any a priori knowledge of either the number of clusters or even the possible range of this number. Unlike existing self-splitting algorithms, which over-cluster the dataset to a large number of clusters and then merge some similar clusters, our framework has the ability to split and merge clusters automatically during the process and produces the the most reliable clustering results, by intrinsically integrating many clustering techniques and tasks. The SMART framework is implemented with two distinct clustering paradigms in two algorithms: competitive learning and finite mixture model. Nevertheless, within the proposed SMART framework, many other algorithms can be derived for different clustering paradigms. The minimum message length algorithm is integrated into the framework as the clustering selection criterion. The usefulness of the SMART framework and its algorithms is tested in demonstration datasets and simulated gene expression datasets. Moreover, two real microarray gene expression datasets are studied using this approach. Based on the performance of many metrics, all numerical results show that SMART is superior to compared existing self-splitting algorithms and traditional algorithms. Three main properties of the proposed SMART framework are summarized as: (1) needing no parameters dependent on the respective dataset or a priori knowledge about the datasets, (2) extendible to many different applications, (3) offering superior performance compared with counterpart algorithms.

Highlights

Clustering methods have been widely used in many fields, including biology, physics, computer science, communications, artificial intelligence, image processing, and medical research, requiring analysis of large quantities of data to explore the relationships between individual objects within the respective datasets [1,2,3,4,5,6,7,8,9,10,11,12,13]
splitting merging awareness tactics’’ (SMART) Framework First of all, we must emphasize that SMART is a framework rather than a simple clustering algorithm, within which a number of clustering techniques are organically integrated
SMART starts with one cluster (K~1, where K is the number of clusters), and the cluster needs to be initialized, which is Task 1

Summary

Introduction

Clustering methods have been widely used in many fields, including biology, physics, computer science, communications, artificial intelligence, image processing, and medical research, requiring analysis of large quantities of data to explore the relationships between individual objects within the respective datasets [1,2,3,4,5,6,7,8,9,10,11,12,13]. There are many families of clustering algorithms used in the gene expression analysis, including partitional clustering, hierarchical clustering, model-based clustering, selforganizing clustering [3,23]. Results of most of successful clustering algorithms strongly depend on the determined number of clusters, e.g. k-means, model-based clustering, and hierarchical clustering (when the clustering memberships need to be determined). The problem of determining the best number of clusters needs to be addressed in another branch of research in clustering analysis, known as clustering validation [26,27,28]. Among various clustering validation criteria, clustering validity indices, known as relative criteria, have been employed to quantitatively evaluate the goodness of a clustering result and estimate the best number of clusters. There are two main classes of validity indices: a) model-based or information theoretic validation, e.g. minimum description length (MDL) [29], minimum message length (MML) [30,31], Bayesian information criterion (BIC) [32], Akaike’s information criterion (AIC) [33], and the normalized entropy criterion (NEC) [34]; b) geometric-based validation, which considers the ratio of within-group distance to between-group distance (or its reciprocal), such as CalinskiHarabasz (CH) index [35], Dunn’s index (DI) [36], Davies-Bouldin (DB) index [37], I index [38], Silhouette index (SI) [39], the geometrical index (GI) [40], the validity index VI [41] and the parametric validity index (PVI) [42,43]

Objectives

Methods

Results

Discussion

Conclusion