Abstract

In high-dimensional data analysis, Feature Selection (FS) is one of the most fundamental issues in machine learning and requires the attention of researchers. These datasets are characterized by huge space due to a high number of features, out of which only a few are significant for analysis. Thus, significant feature extraction is crucial. There are various techniques available for feature selection; among them, the filter techniques are significant in this community, as they can be used with any type of learning algorithm and drastically lower the running time of optimization algorithms and improve the performance of the model. Furthermore, the application of a filter approach depends on the characteristics of the dataset as well as on the machine learning model. Thus, to avoid these issues in this research, a combination of feature reduction (CFR) is considered designing a pipeline of filter approaches for high-dimensional microarray data classification. Considering four filter approaches, sixteen combinations of pipelines are generated. The feature subset is reduced in different levels, and ultimately, the significant feature set is evaluated. The pipelined filter techniques are Correlation-Based Feature Selection (CBFS), Chi-Square Test (CST), Information Gain (InG), and Relief Feature Selection (RFS), and the classification techniques are Decision Tree (DT), Logistic Regression (LR), Random Forest (RF), and k-Nearest Neighbor (k-NN). The performance of CFR depends highly on the datasets as well as on the classifiers. Thereafter, the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) method is used for ranking all reduction combinations and evaluating the superior filter combination among all.

Highlights

  • Over the years, researchers have been trying with microarray technology to track gene expression on a genomic scale

  • Inspired by the above analysis, which is discussed by several researchers, this paper proposes a pipeline of reduction combinations using filter approaches

  • K-NN [28,29] chooses the class value of a new instance by examining a set of the k closest instances, as shown in Equation (6) in the training set and selecting the most frequent class value among them, with k set to five and Euclidean distance matrices used to calculate the similarity between two points. It stores the query data based on a similarity measure and the training data. k-Nearest Neighbor (k-NN) parameter tuning is performed to improve the performance by selecting an appropriate value of k

Read more

Summary

Introduction

Researchers have been trying with microarray technology to track gene expression on a genomic scale. Cancer diagnosis and classification are possible through examining the expression of genes. The use of microarray technology to analyze gene expression has opened up a world of possibilities for studying cell and organism biology [1]. Every researcher primarily focuses especially on the behavior of genes across the conditions of the experiment studied; recently, biomedical applications have fueled both the use of available technologies and the efficient implementation of new analytical tools to deal with these complex data. Microarray data analysis yields useful results that aid in the resolution of gene expression problems. Cancer categorization is one of the most significant uses of microarray data analysis. This reflects variations in the levels of expression of various genes.

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call