Abstract

Availability of diverse types of high-throughput data increases the opportunities for researchers to develop computational methods to provide a more comprehensive view for the mechanism and therapy of cancer. One fundamental goal for oncology is to divide patients into subtypes with clinical and biological significance. Cluster ensemble fits this task exactly. It can improve the performance and robustness of clustering results by combining multiple basic clustering results. However, many existing cluster ensemble methods use a co-association matrix to summarize the co-occurrence statistics of the instance-cluster, where the relationship in the integration is only encapsulated at a rough level. Moreover, the relationship among clusters is completely ignored. Finding these missing associations could greatly expand the ability of cluster ensemble methods for cancer subtyping. In this paper, we propose the RWCE (Random Walk based Cluster Ensemble) to consider similarity among clusters. We first obtained a refined similarity between clusters by using random walk and a scaled exponential similarity kernel. Then, after being modeled as a bipartite graph, a more informative instance-cluster association matrix filled with the aforementioned cluster similarity was fed into a spectral clustering algorithm to get the final clustering result. We applied our method on six cancer types from The Cancer Genome Atlas (TCGA) and breast cancer from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC). Experimental results show that our method is competitive against existing methods. Further case study demonstrates that our method has the potential to find subtypes with clinical and biological significance.

Highlights

  • With the efforts of the large-scale projects such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) [1,2,3], a wealth of genome-scale molecular data are available and easy to access

  • In order to show the effectiveness of our method, we used six TCGA cancer types: kidney renal clear cell carcinoma (KIRC), glioblastoma multiforme (GBM), lung squamous cell carcinoma (LUSC), breast invasive carcinoma (BRCA), acute myeloid leukemia (LAML), and colon adenocarcinoma (COAD)

  • A novel Refined Instance-Cluster Association (RIC) matrix is used in RWCE that considers relationships among clusters, which contributes to a superior clustering

Read more

Summary

Introduction

With the efforts of the large-scale projects such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) [1,2,3], a wealth of genome-scale molecular data are available and easy to access. The multiple types of omics data from genomes, transcriptomes, proteome, and epigenomes enable researchers to embrace great opportunities and possibilities to explore a more comprehensive view into cancer informatics, such as drug target prediction [4,5], diver gene identification [6,7,8], and so on. Applying traditional clustering algorithms on a single data type—like gene expression data—does not obtain satisfactory results such as deriving subtypes. Genes 2019, 10, 66 associated with clinical phenotype [9]. These unsatisfactory results indicate the limitations of expression-based analysis for cancer subtyping. Since different types of molecular data contain information in various aspects that may complement each other, it is beneficial for leveraging different

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call