Cancer Classification using Ensemble Feature Selection and Random Forest Classifier

Nimrita Koul,Sunilkumar S Manvi

doi:10.1088/1757-899x/1074/1/012004

Nimrita Koul, Sunilkumar S Manvi

Open Access

https://doi.org/10.1088/1757-899x/1074/1/012004

Copy DOI

Abstract

High volumes of genomic data made available by high through put gene expression sequencing technologies like next generation sequencing, microarray gene expression data have made it possible to develop models to computationally analyse this data and infer meaningful insights like presence of a disease, nature of disease, place of localization of the tumour in cancers etc. Since gene expression data is very high dimensional, each gene stands for one dimension, and has very small number of observations, it is imperative to apply feature selection on the data before using it for classification task. In this paper, we have proposed a method for classification of human cancer types by analysis of microarray gene expression data. We have used an ensemble feature selection algorithm for selecting subsets of 5, 10, 20 and 30 genes and applied random forest classifiers to obtain the classification accuracy and other performance parameters for comparison with existing solutions. We have been able to obtain 100% classification accuracy with just 5 genes on colon cancer data set with our algorithm.

Full Text