Abstract

ABSTRACTIn recent years, tumor classification based on gene expression profiles has drawn great attention, and related research results have been widely applied to the clinical diagnosis of major gene diseases. These studies are of tremendous importance for accurate cancer diagnosis and subtype recognition. However, the microarray data of gene expression profiles have small samples, high dimensionality, large noise and data redundancy. To further improve the classification performance of microarray data, a gene selection approach based on the Fisher linear discriminant (FLD) and the neighborhood rough set (NRS) is proposed. First, the FLD method is employed to reduce the preliminarily genetic data to obtain features with a strong classification ability, which can form a candidate gene subset. Then, neighborhood precision and neighborhood roughness are defined in a neighborhood decision system, and the calculation approaches for neighborhood dependency and the significance of an attribute are given. A reduction model of neighborhood decision systems is presented. Thus, a gene selection algorithm based on FLD and NRS is proposed. Finally, four public gene datasets are used in the simulation experiments. Experimental results under the SVM classifier demonstrate that the proposed algorithm is effective, and it can select a smaller and more well-classified gene subset, as well as obtain better classification performance.

Highlights

  • With the development of gene expression profiles, the analysis and modelling of gene expression profiles has become an important topic in the field of bioinformatics research [1,2,3]

  • To verify the effectiveness of the proposed Fisher linear discriminant (FLD)-neighborhood rough set (NRS) algorithm, simulation experiments are performed on four public gene expression profile data sets, which include colon, leukaemia, lung, and prostate cancer data downloaded from http://bioinformatics.rutgers. ed/Static/ Supplemens/CompCancer/datasets

  • A genetic selection method based on the FLD and NRS is proposed in view of poor stability, large feature subset size, and time-consuming calculations of various gene selection algorithms

Read more

Summary

Introduction

With the development of gene expression profiles, the analysis and modelling of gene expression profiles has become an important topic in the field of bioinformatics research [1,2,3]. The high dimension of tumor gene expression data, which is often in the thousands or even tens of thousands, increases the learning cost and deteriorates learning performance This is widely known as the ‘‘Curse of Dimensionality’’, which costs time and reduces the effectiveness of classification when using a classifier to forecast new samples [4,5]. To further improve the classification performance of microarray data, effectively remove the redundant gene, and reduce the computational time complexity of the gene selection algorithm, the FLD method is employed to conduct the preliminary dimensionality reduction for microarray gene data. FLD can be used to select the characteristics with the possessed information classified, eliminate redundant attributes, and achieve the processing of dimensionality reduction for gene data. If the k-th column wk of W is considered, the objective function can be transformed into maxwkwkT SBwk:

A Lagrangian equation is established as
Experimental results and analysis
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call