Abstract

BackgroundThe definition of a distance measure plays a key role in the evaluation of different clustering solutions of gene expression profiles. In this empirical study we compare different clustering solutions when using the Mutual Information (MI) measure versus the use of the well known Euclidean distance and Pearson correlation coefficient.ResultsRelying on several public gene expression datasets, we evaluate the homogeneity and separation scores of different clustering solutions. It was found that the use of the MI measure yields a more significant differentiation among erroneous clustering solutions. The proposed measure was also used to analyze the performance of several known clustering algorithms. A comparative study of these algorithms reveals that their "best solutions" are ranked almost oppositely when using different distance measures, despite the found correspondence between these measures when analysing the averaged scores of groups of solutions.ConclusionIn view of the results, further attention should be paid to the selection of a proper distance measure for analyzing the clustering of gene expression data.

Highlights

  • The definition of a distance measure plays a key role in the evaluation of different clustering solutions of gene expression profiles

  • The results show that the sIB algorithm [32,33], which is originally based on a mutual-information criterion, obtains better Mutual Information (MI)-based homogeneity and separation scores than those provided by the K-means, the CLICK and the SOM algorithms [5,21]

  • In the first experiment, which is based on known clustering solutions, we show the statistical superiority of the average MIbased measure independently of the selected clustering algorithm

Read more

Summary

Introduction

The definition of a distance measure plays a key role in the evaluation of different clustering solutions of gene expression profiles. In this empirical study we compare different clustering solutions when using the Mutual Information (MI) measure versus the use of the well known Euclidean distance and Pearson correlation coefficient. Clustering is a central analysis method of gene-expressions that has been implemented extensively in various works and applications [1,2,3,4,5]. The primary goal is to cluster together genes or tissues that manifest similar expression patterns [1]. Similar expression patterns might offer insights into various transcriptional and biological processes [6,7,8]

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.