Abstract

This paper focuses on the stability-based approach for estimating the number of clusters K in microarray data. The cluster stability approach amounts to performing clustering successively over random subsets of the available data and evaluating an index which expresses the similarity of the successive partitions obtained. We present a method for automatically estimating K by starting from the distribution of the similarity index. We investigate how the selection of the hierarchical clustering (HC) method, respectively, the similarity index, influences the estimation accuracy. The paper introduces a new similarity index based on a partition distance. The performance of the new index and that of other well-known indices are experimentally evaluated by comparing the true data partition with the partition obtained at each level of an HC tree. A case study is conducted with a publicly available Leukemia dataset.

Highlights

  • The clustering algorithms are frequently used for analyzing the microarray data

  • In order to illustrate the challenge of structure estimation for microarray data, we consider the leukemia dataset described in [13], publicly available at http://www-genome.wi. mit.edu/cgi-bin/cancer/datasets.cgi, which comes from a study of gene expression in two types of acute leukemias, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML)

  • The algorithm identifies successfully the lack of structure for model: 100% (Model 1), and for other five structured models, the percentage of correct estimation is larger than 70% which recommends the use of sw for a wide family of input data distributions, even if some variables are noisy

Read more

Summary

INTRODUCTION

The clustering algorithms are frequently used for analyzing the microarray data. While various clustering methods help the practitioner in bioinformatics to ascertain different characteristics in structural organization of microarray datasets, the task of selecting the most appropriate algorithm for solving a particular problem is nontrivial. The similarity index is computed for the samples contained in the selected subset In both approaches, it is assumed that the number of clusters is k ∈ {2, 3, . We restrict our investigation to the agglomerative hierarchical clustering (HC) algorithms [10] mainly because this class of clustering methods is very popular in microarray data analysis [11]. These algorithms are computationally efficient since the same tree can be used for all values of k ∈ {2, 3, . Comparisons with other methods are reported for simulated data, and a case study is conducted on Leukemia dataset [13]

MOTIVATION OF THE WORK
SIMILARITY MEASURES
A similarity index defined as complement of a partition distance
Similarity indices “corrected for chance”
STABILITY-BASED METHOD FOR ESTIMATING THE NUMBER OF CLUSTERS
Performance evaluation with simulated data
Clustering the Leukemia dataset
CONCLUSION
PROOFS OF PROPOSITIONS
Findings
ASYMPTOTIC AND FINITE SAMPLE CHARACTERISTICS FOR THE SIMILARITY INDICES

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.