Abstract

High dimensional biomedical data contain tens of thousands of features, accurate and effective identification of the core features in these data can be used to assist diagnose related diseases. However, there are often a large number of irrelevant or redundant features in biomedical data, which seriously affect subsequent classification accuracy and machine learning efficiency. To solve this problem, a novel filter feature selection algorithm based on redundant removal (FSBRR) is proposed to classify high dimensional biomedical data in this paper. First of all, two redundant criteria are determined by vertical relevance (the relationship between feature and class attribute) and horizontal relevance (the relationship between feature and feature). Secondly, to quantify redundant criteria, an approximate redundancy feature framework based on mutual information (MI) is defined to remove redundant and irrelevant features. To evaluate the effectiveness of our proposed algorithm, controlled trials based on typical feature selection algorithm are conducted using three different classifiers, and the experimental results indicate that the FSBRR algorithm can effectively reduce the feature dimension and improve the classification accuracy. In addition, an experiment of small sample dataset is designed and conducted in the section of discussion and analysis to clarify the specific implementation process of FSBRR algorithm more clearly.

Highlights

  • The analysis of high dimensional disease data [1,2] is a very important research field, especially cancer [3], or mental disease (e.g. Depressive [4,5])

  • In feature selection algorithm based on redundant removal (FSBRR), Relief, maximum relevance and minimum redundancy (mRmR) and genetic algorithm (GA), GA belongs to the wrapper feature selection algorithm, so there were differences in the number of feature subsets for RF, KNN and SWM classifiers

  • (5) In most experiments, except for the running time index, the other performances of RF were significantly better than the KNN and SVM for the same dataset

Read more

Summary

Introduction

The analysis of high dimensional disease data [1,2] is a very important research field, especially cancer [3], or mental disease (e.g. Depressive [4,5]). It is unrealistic to cure these diseases completely, so early diagnosis or prevention plays an important role in the treatment related disease. High dimension biomedical data usually contain a large number of weak relevant or irrelevant features. If all the features are treated the time complexity, spatial complexity and accuracy of the prediction can be seriously affected. Feature selection is considered to be an essential step in the diagnosis of related disease using high dimension biomedical data.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call