Comparative evaluation of network features for the prediction of breast cancer metastasis

Nahim Adnan,Jianhua Ruan,Tim H.M Huang,Zhijie Liu

doi:10.1186/s12920-020-0676-3

Abstract

BackgroundDiscovering a highly accurate and robust gene signature for the prediction of breast cancer metastasis from gene expression profiling of primary tumors is one of the most challenging tasks to reduce the number of deaths in women. Due to the limited success of gene-based features in achieving satisfactory prediction accuracy, many methodologies have been proposed in recent years to develop network-based features by integrating network information with gene expression. However, evaluation results are inconsistent to confirm the effectiveness of network-based features, because of many confounding factors involved in classification model learning process, such as data normalization, dimension reduction, and feature selection. An unbiased comparative evaluation is essential for uncovering the strength of network-based features.MethodsIn this study, we compared several types of network-based features obtained using different mathematical operators (Mean, Maximum, Minimum, Median, Variance) on geneset (i.e., a gene and its’ neighbors in the network) in protein-protein interaction network and gene co-expression network for their ability in predicting breast cancer metastasis using gene expression data from more than 10 patient cohorts.ResultsWhile network-based features are usually statistically more significant than gene-based feature, a consistent improvement of prediction performance using network-based features requires a substantial number of patients in the dataset. In contrary to many previous reports, no evidence was found to support the robustness of network-based features and we argue some of the robustness may be due to the inherent bias associated with node degree in the network. In addition, different types of network features seem to cover different pathways and are complementary to each other. Consequently, an ensemble classifier combining different network features was proposed and was found to significantly outperform classifiers based on gene-based feature or any single type of network-based features.ConclusionsNetwork-based features and their combination show promise for improving the prediction of breast cancer metastasis but may require a large amount of training data. Robustness claim of network-based features needs to be re-examined with network node degree and other confounding factors in consideration.

Highlights

Discovering a highly accurate and robust gene signature for the prediction of breast cancer metastasis from gene expression profiling of primary tumors is one of the most challenging tasks to reduce the number of deaths in women
The “Desmedt” dataset has relatively more patients and the class distribution is less skewed, no features passed the False Discovery Rate (FDR) corrected p-value threshold, indicating that the differential analysis depends on the nature of the dataset
Note that the total number of CEEdge and PPIEdge features are much larger than other network-based features, these feature types have the highest number of significant features

Summary

Introduction

Discovering a highly accurate and robust gene signature for the prediction of breast cancer metastasis from gene expression profiling of primary tumors is one of the most challenging tasks to reduce the number of deaths in women. Evaluation results are inconsistent to confirm the effectiveness of network-based features, because of many confounding factors involved in classification model learning process, such as data normalization, dimension reduction, and feature selection. About 5% of women have metastatic (i.e., recurrence of cancer) breast cancer at their first diagnosis [3]. Histology and tumor size of the patients are not sufficient to determine breast cancer metastasis [4]. Due to the availability of gene expression data for primary cancerous tumors, many methods have been developed to predict breast cancer metastasis outcomes over the last decade. The patient being free from recurrence for at least 5 years and relapse occurring within 5 years after the first diagnosis are termed as good and poor outcomes respectively

Objectives

Methods

Results

Conclusion