Abstract

The correct classification of cancer subtypes is of great significance for the in-depth study of cancer pathogenesis and the realization of accurate treatment for cancer patients. In recent years, the classification of cancer subtypes using deep neural networks and gene expression data has become a hot topic. However, most classifiers may face the challenges of overfitting and low classification accuracy when dealing with small sample size and high-dimensional biological data. In this paper, the Cascade Flexible Neural Forest (CFNForest) Model was proposed to accomplish cancer subtype classification. CFNForest extended the traditional flexible neural tree structure to FNT Group Forest exploiting a bagging ensemble strategy and could automatically generate the model's structure and parameters. In order to deepen the FNT Group Forest without introducing new hyperparameters, the multilayer cascade framework was exploited to design the FNT Group Forest model, which transformed features between levels and improved the performance of the model. The proposed CFNForest model also improved the operational efficiency and the robustness of the model by sample selection mechanism between layers and setting different weights for the output of each layer. To accomplish cancer subtype classification, FNT Group Forest with different feature sets was used to enrich the structural diversity of the model, which make it more suitable for processing small sample size datasets. The experiments on RNA-seq gene expression data showed that CFNForest effectively improves the accuracy of cancer subtype classification. The classification results have good robustness.

Highlights

  • Cancer is a heterogeneous lesion caused by the loss of the normal regulation of local tissue cell growth at the gene level under the action of carcinogenic factors [1]

  • Datasets. e RNA sequence gene expression data were used in this paper, which were downloaded from e Cancer Genome Atlas (TCGA) [15]. ree types of cancers were downloaded from the TCGA database and sorted: Breast Invasive Carcinoma (BRCA), Glioblastoma Multiforme (GBM), and Lung Cancer (LUNG). e labeling of each sample is based on real clinical data of cancer patients provided by TCGA. ere are four basic subtypes of BRCA: Basal-like (98/∼19.06%), HER2-enriched (58/∼11.28%), Luminal-A (231/∼44.94%), and Luminal-B (127/∼24.72%)

  • In order to demonstrate the superiority of CFNForest in classification performance, we compared it with k-nearest neighbor (KNN), the probabilistic graphical model (PGM) [11], support vector machine (SVM), random forest (RF), discriminative deep belief network (DDBN) [13], and boosting cascade deep forest (BCDForest) [12], respectively. e gene information obtained after feature processing is used as the input to each classifier

Read more

Summary

Introduction

Cancer is a heterogeneous lesion caused by the loss of the normal regulation of local tissue cell growth at the gene level under the action of carcinogenic factors [1]. Cancer has become one of the major causes of human death [2]. Traditional cancer research methods were mostly based on clinical experience. The molecular expression level of cancer is highly heterogeneous, which means that there are many molecular subtypes in cancer tissue. Cancer patients with the same symptoms can show significant prognostic differences under the same treatment regimens [3]. Heterogeneity is one of the fundamental features of cancer, and it is the biggest challenge for the development of precision therapy for cancer [4]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call