Abstract

Due to the disproportionate difference between the number of genes and samples, microarray data analysis is considered an extremely difficult task in sample classification. Feature selection mitigates this problem by removing irrelevant and redundant genes from data. In this paper, we propose a new methodology for feature selection that aims to detect relevant, non-redundant and interacting genes by analysing the feature value space instead of the feature space. Following this methodology, we also propose a new feature selection algorithm, namely Pavicd (Probabilistic Attribute-Value for Class Distinction). Experiments in fourteen microarray cancer datasets reveal that Pavicd obtains the best performance in terms of running time and classification accuracy when using Ripper-k and C4.5 as classifiers. When using SVM (Support Vector Machine), the Gbc (Genetic Bee Colony) wrapper algorithm gets the best results. However, Pavicd is significantly faster.

Highlights

  • Microarray is a multiplex technology used in molecular biology and medicine that enables biologists to monitor expression levels of thousands of genes [1]

  • The main motivation of this paper is to present an efficient gene selection algorithm able to detect complex relation among relevant genes that yields a significant improvement in the sample classification problem

  • For the sequential forward search, we start with Fk equal to the feature value in Gk that maximizes Equation (7), and, in each j j iteration, we explore Gk so that feature value f i ∈ Gk that maximizes μ(Fk, f i, ck ) is selected, and feature j j value f i such that μ(Fk, f i, ck ) < μ(Fk, ck ) holds, is removed from Gk and never tested again

Read more

Summary

Introduction

Microarray is a multiplex technology used in molecular biology and medicine that enables biologists to monitor expression levels of thousands of genes [1]. Many microarray experiments have been designed to investigate the genetic mechanisms of cancer [2] and to discover new drug designs in the pharmaceutical industry [3]. According to the World Health Organization, cancer is among the leading causes of death worldwide accounting for more than 8 million deaths. Finding a mechanism to discover the genetic expressions that may lead to an abnormal growth of cells is a first order task today. Short sequences of genes tagged with fluorescent materials are printed on a glass surface for hibridization [4]. The resulting dataset is a two-dimensional array D with thousands of columns (genes) and several rows (instances): x11

Objectives
Methods
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.