Abstract

The aim of this study is to filter out the most informative genes that mainly regulate the target tissue class, increase classification accuracy, reduce the curse of dimensionality, and discard redundant and irrelevant genes. This paper presented the idea of gene selection using bagging sub-forest (BSF). The proposed method provided genes importance grounded on the idea specified in the standard random forest algorithm. The new method is compared with three state-of-the art methods, i.e., Wilcoxon, masked painter and proportional overlapped score (POS). These methods were applied on 5 data sets, i.e. Colon, Lymph node breast cancer, Leukaemia, Serrated colorectal carcinomas, and Breast Cancer. Comparison was done by selecting top 20 genes by applying the gene selection methods and applying random forest (RF) and support vector machine (SVM) classifiers to assess their predictive performance on the datasets with selected genes. Classification accuracy, Brier score, and sensitivity have been used as performance measures. The proposed method gave better results than the other methods using both random forest and SVM classifiers on all the datasets among all the feature selection methods. The proposed method showed improved performance in terms of classification accuracy, Brier score and sensitivity, and hence, could be used as a novel method for gene selection to classify tissue samples into their correct classes.

Highlights

  • Cancer is a genetic disease due to changes in some of the genes that control the way how our body cells function

  • Gene selection procedures helps in discovering discreminitive genes that regulate the target class eliminating irrelevant and useless genes that do not play any role in the regulation of the response class[3]

  • The first column showed the classifiers, the second was the performance measure used against each of the classifiers and the subsequent columns were the values obtained for the measures for each gene selection method

Read more

Summary

Introduction

Cancer is a genetic disease due to changes in some of the genes that control the way how our body cells function (for example, growth and division to make new cells). Statistical models lose generalization power and interpretability when applied to unseen data; in that very few genes regulate the target class and rest are redundant (genes with similar expression values) or non-informative[4,5]. These analyses require considerable computational resources[3]. Genes that have high relevance scores are selected for the purpose of classification by the help of different classifiers These methods do not require classification algorithem for the selection of important genes. Filter methods deals with big data and are computationally fast and simple

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call