Abstract

Automated diagnosis systems aim to reduce the cost of diagnosis while maintaining the same efficiency. Many methods have been used for breast cancer subtype classification. Some use single data source, while others integrate many data sources, the case that results in reduced computational performance as opposed to accuracy. Breast cancer data, especially biological data, is known for its imbalance, with lack of extensive amounts of histopathological images as biological data. Recent studies have shown that cascade Deep Forest ensemble model achieves a competitive classification accuracy compared with other alternatives, such as the general ensemble learning methods and the conventional deep neural networks (DNNs), especially for imbalanced training sets, through learning hyper-representations through using cascade ensemble decision trees. In this work, a cascade Deep Forest is employed to classify breast cancer subtypes, IntClust and Pam50, using multi-omics datasets and different configurations. The results obtained recorded an accuracy of 83.45% for 5 subtypes and 77.55% for 10 subtypes. The significance of this work is that it is shown that using gene expression data alone with the cascade Deep Forest classifier achieves comparable accuracy to other techniques with higher computational performance, where the time recorded is about 5 s for 10 subtypes, and 7 s for 5 subtypes.

Highlights

  • Breast cancer is one of the main causes of cancer death worldwide

  • The proposed system in this manuscript uses integrative clinical data and genomics data generated from the extraction and combination of the gene expression, Copy Number Aberrations (CNA), and Copy Number Variations (CNV) feature sets from the genomics dataset

  • After the first phase of four breast cancer subtypes datasets acquisition, the proposed system moves to the second phase of data preparation and preprocessing with only three sub-datasets; namely the clinical data, the features of Copy Number Aberrations (CNA) and Copy Number Variations (CNV) data types, as the fourth sub-dataset of gene expression is submitted as it is without any preprocessing to the third phase of integrated data profiles generation

Read more

Summary

Introduction

Breast cancer is one of the main causes of cancer death worldwide. Computer-aided diagnosis systems aim to reduce the cost of diagnosis while maintaining the same efficiency of the process. Reference [8] presents GcForest-PPI, which is a model that uses Deep Forest for the prediction of protein–protein interaction networks. Their model showed and enhanced prediction accuracy and a suggested improvement in drug discovery. Reference [10] uses deep learning with Random Forests on the METABRIC dataset, to make use of the different types of data. Their results enhanced the sensitivity values by 5.1%. Several studies have used deep learning and Deep Forest with the histopathological images data and mammography images [2,3,11,12,13]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.