Abstract

Gene expression profiles can be utilized in the diagnosis of critical diseases such as cancer. The selection of biomarker genes from these profiles is significant and crucial for cancer detection. This paper presents a framework proposing a two-stage multifilter hybrid model of feature selection for colon cancer classification. Colon cancer is being extremely common nowadays among other types of cancer. There is a need to find fast and an accurate method to detect the tissues, and enhance the diagnostic process and the drug discovery. This paper reports on a study whose objective has been to improve the diagnosis of cancer of the colon through a two-stage, multifilter model of feature selection. The model described deals with feature selection using a combination of Information Gain and a Genetic Algorithm. The next stage is to filter and rank the genes identified through this method using the minimum Redundancy Maximum Relevance (mRMR) technique. The final phase is to further analyze the data using correlated machine learning algorithms. This two-stage approach, which involves the selection of genes before classification techniques are used, improves success rates for the identification of cancer cells. It is found that Decision Tree, K-Nearest Neighbor, and Naïve Bayes classifiers had showed promising accurate results using the developed hybrid framework model. It is concluded that the performance of our proposed method has achieved a higher accuracy in comparison with the existing methods reported in the literatures. This study can be used as a clue to enhance treatment and drug discovery for the colon cancer cure.

Highlights

  • Cancer is reckoned, by the World Health Organisation (WHO), to be the secondmost communal source of death in the world [1]

  • The least accurate algorithm was Support Vector Machines (SVM) (81.25%), whilst the level of performance achieved by Naïve Bayes (NB) (87.5%) was acceptable; 2) for the dataset 2 was that NB performed the best with a classification accuracy measured at (100%) under the implication of the two-stage model

  • A confusion matrix records True Positives (TP), which are the number of successfully identified positive samples, True Negatives (TN), which are the number of correctly identified negative samples, False Positives (FP), the samples erroneously diagnosed as being positive, and False Negatives (FN), those positive samples wrongly diagnosed as negative

Read more

Summary

Introduction

Cancer is reckoned, by the World Health Organisation (WHO), to be the secondmost communal source of death in the world [1]. The first one was collected from Alon et al [71], which has been used in several colon cancer research studies [18, 46,47,48,49,50,51,52,53,54,55,56,57, 59, 72] This dataset is publicly available and is still utilized in most recent studies [22,23,24, 35, 57, 56, 73,74,75,76,77].

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call