Abstract

In this paper, we focused on developing a clustering approach for biological data. In many biological analyses, such as multiomics data analysis and genome-wide association studies analysis, it is crucial to find groups of data belonging to subtypes of diseases or tumors. Conventionally, the k-means clustering algorithm is overwhelmingly applied in many areas including biological sciences. There are, however, several alternative clustering algorithms that can be applied, including support vector clustering. In this paper, taking into consideration the nature of biological data, we propose a maximum likelihood clustering scheme based on a hierarchical framework. This method can perform clustering even when the data belonging to different groups overlap. It can also perform clustering when the number of samples is lower than the data dimensionality. The proposed scheme is free from selecting initial settings to begin the search process. In addition, it does not require the computation of the first and second derivative of likelihood functions, as is required by many other maximum likelihood-based methods. This algorithm uses distribution and centroid information to cluster a sample and was applied to biological data. A MATLAB implementation of this method can be downloaded from the web link http://www.riken.jp/en/research/labs/ims/med_sci_math/.

Highlights

  • T HE aim of unsupervised clustering algorithms is to partition the data into clusters

  • We carry out analysis on artificial data as well as on biological data to evaluate the performance of hierarchical maximum likelihood (HML)

  • We proposed a hierarchical maximum likelihood (HML) method by considering the topologies of genomic data

Read more

Summary

Introduction

T HE aim of unsupervised clustering algorithms is to partition the data into clusters. In this case, the class label information is unknown; i.e., the knowledge regarding the state of the nature of samples is not provided and clustering is performed by taking into account a similarity or distance measure, distribution information or by some objective functions. In biological data (e.g. genomic data, transcriptomic data) the number of clusters, as well as the location of clusters, are unknown. It would be beneficial to develop a scheme that takes into account the distribution information as well

Objectives
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.