Abstract

The emergence of high-throughput RNA-seq data has offered unprecedented opportunities for cancer diagnosis. However, capturing biological data with highly nonlinear and complex associations by most existing approaches for cancer diagnosis has been challenging. In this study, we propose a novel hierarchical feature selection and second learning probability error ensemble model (named HFS-SLPEE) for precision cancer diagnosis. Specifically, we first integrated protein-coding gene expression profiles, non-coding RNA expression profiles, and DNA methylation data to provide rich information; afterward, we designed a novel hierarchical feature selection method, which takes the CpG-gene biological associations into account and can select a compact set of superior features; next, we used four individual classifiers with significant differences and apparent complementary to build the heterogeneous classifiers; lastly, we developed a second learning probability error ensemble model called SLPEE to thoroughly learn the new data consisting of classifiers-predicted class probability values and the actual label, further realizing the self-correction of the diagnosis errors. Benchmarking comparisons on TCGA showed that HFS-SLPEE performs better than the state-of-the-art approaches. Moreover, we analyzed in-depth 10 groups of selected features and found several novel HFS-SLPEE-predicted epigenomics and epigenetics biomarkers for breast invasive carcinoma (BRCA) (e.g., TSLP and ADAMTS9-AS2), lung adenocarcinoma (LUAD) (e.g., HBA1 and CTB-43E15.1), and kidney renal clear cell carcinoma (KIRC) (e.g., IRX2 and BMPR1B-AS1).

Highlights

  • Cancer has the characteristics of concealed onset, low cure rate, and high mortality

  • To verify whether HFS-SLPEE can generalize the diagnosis of different cancers, we researched on three high-incidence cancers breast invasive carcinoma (BRCA), lung adenocarcinoma (LUAD), and kidney renal clear cell carcinoma (KIRC)

  • N = 21, n = 12, n = 16, respectively, as the number of features selected for the BRCA, LUAD, and KIRC

Read more

Summary

Introduction

Cancer has the characteristics of concealed onset, low cure rate, and high mortality. Numerous studies utilized epigenetic data such as microRNA (miRNA) (Saha et al, 2015), long non-coding RNA (lncRNA) expression profiles (Zhang et al, 2018), and DNA methylation (Al-Juniad et al, 2018) for cancer diagnosis and subtype classification, obtaining some achievements (Raweh et al, 2018; Tang et al, 2018). Classical genetics and epigenetics are two separate mechanisms participating in carcinogenesis (Network, 2012) Epigenetics data such as ncRNA and DNA methylation are not independent of each other, and they often have synergistic effects (Xu et al, 2018). Only using protein-coding gene expression profiles and/or ncRNA expression profiles or DNA methylation data leads to the lack of information and prevents the high-performance and robustness of cancer diagnosis from being significantly improved

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call