Abstract

A promising research field in bioinformatics and data mining is the classification of cancer based on gene expression results. Efficient sample classification is not supported by all genes. Thus, to identify the appropriate genes that help efficiently distinguish samples, a robust feature selection method is needed. Redundancy in the data on gene expression contributes to low classification performance. This paper presents the combination for gene selection and classification methods using ranking and wrapper methods. In ranking methods, information gain was used to reduce the size of dimensionality to 1% and 5%. Then, in wrapper methods K-nearest neighbors and Naïve Bayes were used with Best First, Greedy Stepwise, and Rank Search. Several combinations were investigated because it is known that no single model can give the best results using different datasets for all circumstances. Therefore, combining multiple feature selection methods and applying different classification models could provide a better decision on the final predicted cancer types. Compared with the existing classifiers, the proposed assembly gene selection methods obtained comparable performance.

Highlights

  • Gene expression is called the process of transcription of the Deoxyribo Nucleic Acid (DNA) sequence into Ribo Nucleic Acid (RNA)

  • After applying different combinations using ranking and wrapper methods, we found that Information Gain & Wrapper (NB & Best First) and Information Gain & Wrapper (NB & Gready Stepwise) obtained the best performance compared to all other methods/combinations before and after applying feature selection

  • When the top 5% genes were selected in the ranking method, the performance of the used methods in Table III showed that random forest obtained the best results when high dimensional dataset was used, but when wrapper methods were applied, the combination of Information Gain and Wrapper (NB & Best First) obtained the best results

Read more

Summary

Introduction

Gene expression is called the process of transcription of the Deoxyribo Nucleic Acid (DNA) sequence into Ribo Nucleic Acid (RNA). Microarray is the technique for simultaneous measurements of the expression level in a single chip of tens of thousands of genes. Microarrays provide an effective way to collect data that can be used to establish the pattern of expression of thousands of genes. High gene expression data is a major challenge. Incorrect diagnosis can be accomplished by using both genes in the Microarray classification of gene expression. The two key explanations for low classification precision are two: large number of features (genes) against limited sample size and dimensional consistency in articulated data [2]. Standard machine learning methods have not been effective, since these methods are better suited when there are more samples than features

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call