Improving Model Performance on the Stratification of Breast Cancer Patients by Integrating Multiscale Genomic Features.

Runyu Jing,Yingyi Hao,Zhining Wen,Li He,Yifan Zhou,Menglong Li,Yiru Zhao

doi:10.1155/2020/1475368

Abstract

In clinical cancer research, it is a hot topic on how to accurately stratify patients based on genomic data. With the development of next-generation sequencing technology, more and more types of genomic features, such as mRNA expression level, can be used to distinguish cancer patients. Previous studies commonly stratified patients by using a single type of genomic features, which can only reflect one aspect of the cancer. In fact, multiscale genomic features will provide more information and may be helpful for clinical prediction. In addition, most of the conventional machine learning algorithms use a handcrafted gene set as features to construct models, which is generally selected by a statistical method with an arbitrary cut-off, e.g., p value < 0.05. The genes in the gene set are not necessarily related to the cancer and will make the model unreliable. Therefore, in our study, we thoroughly investigated the performance of different machine learning methods on stratifying breast cancer patients with a single type of genomic features. Then, we proposed a strategy, which can take into account the degree of correlation between genes and cancer patients, to identify the features from mRNAs and microRNAs, and evaluated the performance of the models with the new combined features of the multiscale genomic features. The results showed that, compared with the models constructed with a single type of features, the models with the multiscale genomic features generated by our proposed method achieved better performance on stratifying the ER status of breast cancer patients. Moreover, we found that the identified multiscale genomic features were closely related to the cancer by gene set enrichment analysis, indicating that our proposed strategy can well reflect the biological relevance of the genes to breast cancer. In conclusion, modelling with multiscale genomic features closely related to the cancer not only can guarantee the prediction performance of the models but also can effectively provide candidate genes for interpreting the mechanisms of cancer.

Highlights

Compared with the microarray technology, next-generation sequencing technology including DNA sequencing [1, 2] and RNA sequencing [3, 4] provides multiscale genomic features, such as mRNA expression [5, 6], microRNA expression [7, 8], and gene structure variation [9, 10], to characterize cancers in different aspects at the molecular level
We evaluated the performance of the models, which were separately constructed by using the expression levels of the top n mRNAs and microRNAs identified by the five feature selection methods
For the purpose of obtaining the interpretable features, we proposed the Shapley additive explanation (SHAP) method to identify the genomic features which were closely related to the cancer

Summary

Introduction

Compared with the microarray technology, next-generation sequencing technology including DNA sequencing [1, 2] and RNA sequencing [3, 4] provides multiscale genomic features, such as mRNA expression [5, 6], microRNA expression [7, 8], and gene structure variation [9, 10], to characterize cancers in different aspects at the molecular level These features had been widely used to construct models in clinical cancer researches for distinguishing the subtypes of cancer [11] and stratifying the patients [12], as well as predicting the prognosis of cancers [13]. The clinical cancer samples were firstly divided into two groups according to the phenotypic status.

Methods

Results

Conclusion