Stepwise iterative maximum likelihood clustering approach.

Alok Sharma,Daichi Shigemizu,Yosvany López,Keith A Boroevich,Michiaki Kubo,Yoichiro Kamatani,Tatsuhiko Tsunoda

doi:10.1186/s12859-016-1184-5

Abstract

BackgroundBiological/genetic data is a complex mix of various forms or topologies which makes it quite difficult to analyze. An abundance of such data in this modern era requires the development of sophisticated statistical methods to analyze it in a reasonable amount of time. In many biological/genetic analyses, such as genome-wide association study (GWAS) analysis or multi-omics data analysis, it is required to cluster the plethora of data into sub-categories to understand the subtypes of populations, cancers or any other diseases. Traditionally, the k-means clustering algorithm is a dominant clustering method. This is due to its simplicity and reasonable level of accuracy. Many other clustering methods, including support vector clustering, have been developed in the past, but do not perform well with the biological data, either due to computational reasons or failure to identify clusters.ResultsThe proposed SIML clustering algorithm has been tested on microarray datasets and SNP datasets. It has been compared with a number of clustering algorithms. On MLL datasets, SIML achieved highest clustering accuracy and rand score on 4/9 cases; similarly on SRBCT dataset, it got for 3/5 cases; on ALL subtype it got highest clustering accuracy for 5/7 cases and highest rand score for 4/7 cases. In addition, SIML overall clustering accuracy on a 3 cluster problem using SNP data were 97.3, 94.7 and 100 %, respectively, for each of the clusters.ConclusionsIn this paper, considering the nature of biological data, we proposed a maximum likelihood clustering approach using a stepwise iterative procedure. The advantage of this proposed method is that it not only uses the distance information, but also incorporate variance information for clustering. This method is able to cluster when data appeared in overlapping and complex forms. The experimental results illustrate its performance and usefulness over other clustering methods. A Matlab package of this method (SIML) is provided at the web-link http://www.riken.jp/en/research/labs/ims/med_sci_math/.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1184-5) contains supplementary material, which is available to authorized users.

Highlights

Biological/genetic data is a complex mix of various forms or topologies which makes it quite difficult to analyze
Though the k-means clustering algorithm has been extensively applied [4] due to its simplicity and reasonable level of accuracy, it cannot track clusters when samples of different groups are overlapping to each other
In subsection 4, we discuss the performance in terms of clustering accuracy and rand score of various methods; and, in subsection 5, we discuss stepwise iterative maximum likelihood (SIML) on biological data

Summary

Introduction

Biological/genetic data is a complex mix of various forms or topologies which makes it quite difficult to analyze. Though the k-means clustering algorithm has been extensively applied [4] due to its simplicity and reasonable level of accuracy, it cannot track clusters when samples of different groups are overlapping to each other (i.e., data points of adjacent groups are spread in a way that the groups partly coincide over each other). In biological data, this is sometimes the case, and thereby leads to clusters which may not be accurate.

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Aug 24, 2016
Citations: 66	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Stepwise iterative maximum likelihood clustering approach.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Hierarchical Maximum Likelihood Clustering Approach.
Alok Sharma ... Keith A Boroevich
IEEE transactions on bio-medical engineering | VOL. 64
Alok Sharma, et. al.Alok Sharma ... Keith A Boroevich
24 Mar 2016
IEEE transactions on bio-medical engineering | VOL. 64

Genome-Wide Linkage Analysis Reveals Novel Loci Modifying Plasma Von Willebrand Factor Undetected by Genome-Wide Association
Karl Desch ... David Ginsburg
Blood | VOL. 116
Karl Desch, et. al.Karl Desch ... David Ginsburg
19 Nov 2010
Blood | VOL. 116

Genetic Dissection of Quantitative Resistance to Common Rust (Puccinia sorghi) in Tropical Maize (Zea mays L.) by Combined Genome-Wide Association Study, Linkage Mapping, and Genomic Prediction.
Jiaojiao Ren ... Zhimin Li
Frontiers in Plant Science | VOL. 12
Jiaojiao Ren, et. al.Jiaojiao Ren ... Zhimin Li
02 Jul 2021
Frontiers in Plant Science | VOL. 12

Methods for genetic epidemiology
Aniket Mishra
-
Aniket MishraAniket Mishra
06 Nov 2015
06 Nov 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Stepwise iterative maximum likelihood clustering approach.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics