Abstract

Copy number variation (CNV) may contribute to the development of complex diseases. However, due to the complex mechanism of path association and the lack of sufficient samples, understanding the relationship between CNV and cancer remains a major challenge. The unprecedented abundance of CNV, gene, and disease label data provides us with an opportunity to design a new machine learning framework to predict potential disease-related CNVs. In this paper, we developed a novel machine learning approach, namely, IHI-BMLLR (Integrating Heterogeneous Information sources with Biweight Mid-correlation and L1-regularized Logistic Regression under stability selection), to predict the CNV-disease path associations by using a data set containing CNV, disease state labels, and gene data. CNVs, genes, and diseases are connected through edges and then constitute a biological association network. To construct a biological network, we first used a self-adaptive biweight mid-correlation (BM) formula to calculate correlation coefficients between CNVs and genes. Then, we used logistic regression with L1 penalty (LLR) function to detect genes related to disease. We added stability selection strategy, which can effectively reduce false positives, when using self-adaptive BM and LLR. Finally, a weighted path search algorithm was applied to find top D path associations and important CNVs. The experimental results on both simulation and prostate cancer data show that IHI-BMLLR is significantly better than two state-of-the-art CNV detection methods (i.e., CCRET and DPtest) under false-positive control. Furthermore, we applied IHI-BMLLR to prostate cancer data and found significant path associations. Three new cancer-related genes were discovered in the paths, and these genes need to be verified by biological research in the future.

Highlights

  • Copy number variations (CNVs) contribute to a substantial fraction of human genetic variation and are increasingly involved in disease associations and genome evolution (Lupski, 2015)

  • We introduce the methods in the machine learning framework of finding the CNV–disease path associations

  • The results show that IHI-BMLLR performs clearly better than other methods

Read more

Summary

Introduction

Copy number variations (CNVs) contribute to a substantial fraction of human genetic variation and are increasingly involved in disease associations and genome evolution (Lupski, 2015). Many evidences reveal the causal relationship between CNVs and many human disease phenotypes, including scores of known genomic diseases and hundreds of complex disease traits. Known CNV–disease associations include only a small fraction of CNVs and diseases. Calculation models and methods have been developed to predict the potential CNV–disease associations, which can be used as candidates for biological experimental verifications. Calculation models and methods would greatly reduce the experiment cost and save time in finding new CNV–disease associations

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call