Abstract

With the advent of high throughput technologies, the high-dimensional datasets are increasingly available. This has not only opened up new insight into biological systems but also posed analytical challenges. One important problem is the selection of informative feature-subset and prediction of the future outcome. It is crucial that models are not overfitted and give accurate results with new data. In addition, reliable identification of informative features with high predictive power (feature selection) is of interests in clinical settings. We propose a two-step framework for feature selection and classification model construction, which utilizes a nested and repeated cross-validation method. We evaluated our approach using both simulated data and two publicly available gene expression datasets. The proposed method showed comparatively better predictive accuracy for new cases than the standard cross-validation method.

Highlights

  • Genetic basis of research for complex diseases such as cancer has been increasingly popular in recent years due to the invent of high throughput technologies such as microarray and sequencing technologies

  • To the best of our knowledge, the idea of using nested/repeated cross-validation has been mentioned elsewhere, (i.e. Stone, 1974 firstly briefed the idea of double cross-validation in the research) no existing literature has proposed or assessed a systematic framework to utilize nested/repeated cross validation at computational level. This manuscript has been organized as follows: in Section 2, we briefly introduce relevant statistical concepts and models; in Section 3, we propose the framework of nested/repeated cross-validation for model selection and feature selection; in section 4, we present a simulation study to investigate and compare the difference between using single cross-validation and nested/repeated cross-validation to build the predictive model; in Section 5, two publicly available gene expression datasets on leukemia by Golub et al (1999) and The Cancer Genome Atlas Studies (TCGA Network 2017) on cervical cancer data are used to demonstrate the applicability of repeated/nested cross-validation method in analyzing real high dimensional data

  • Three different embedded methods are implemented in building the predictive model and feature selection, including regularization regression via elastic net, support vector machine, and random forest

Read more

Summary

Introduction

Genetic basis of research for complex diseases such as cancer has been increasingly popular in recent years due to the invent of high throughput technologies such as microarray and sequencing technologies. Such technologies query the expression of thousands of genes simultaneously (Trevino, Falciani & Barrera-Saldana 2007). The dataset is usually high-dimensional with many variables or features, but a relatively small sample size of n. A predictive model can be defined as a statistical model f, an estimate of the true function f , where f is a function that maps from the gene expression data to the class of the subjects: f :X→Y (1). Three different embedded methods are implemented in building the predictive model and feature selection, including regularization regression via elastic net, support vector machine, and random forest

Objectives
Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.