Abstract

Selecting relevant features is a common task in most OMICs data analysis, where the aim is to identify a small set of key features to be used as biomarkers. To this end, two alternative but equally valid methods are mainly available, namely the univariate (filter) or the multivariate (wrapper) approach. The stability of the selected lists of features is an often neglected but very important requirement. If the same features are selected in multiple independent iterations, they more likely are reliable biomarkers. In this study, we developed and evaluated the performance of a novel method for feature selection and prioritization, aiming at generating robust and stable sets of features with high predictive power. The proposed method uses the fuzzy logic for a first unbiased feature selection and a Random Forest built from conditional inference trees to prioritize the candidate discriminant features. Analyzing several multi-class gene expression microarray data sets, we demonstrate that our technique provides equal or better classification performance and a greater stability as compared to other Random Forest-based feature selection methods.

Highlights

  • Identifying discriminant features, for instance from transcriptomics experiments, and modelling classifiers based on them are fundamental tasks when the aim is to highlight biomarkers

  • Multivariate techniques assess the relevance of groups of features simultaneously, by using selection methods coupled with machine learning techniques such as logistic regression, support vector machines (SVM) or random forests (RF) [7,8,9]

  • We compared the classification performance of the classifiers fuzzy pattern – random forest (FPRF).2–250, with those obtained using the features selected from varSelRF and Boruta

Read more

Summary

Introduction

Identifying discriminant features, for instance from transcriptomics experiments, and modelling classifiers based on them are fundamental tasks when the aim is to highlight biomarkers (e.g. genes or transcripts discriminating healthy from diseased samples). Clinical classification based on high throughput molecular profiling has been already explored for a number of complex diseases, such as cancer [1,2]. These studies become crucial in terms of public health when such approaches are considered for clinical practice [3]. Multivariate methods tend to identify different subsets of candidate biomarkers with equal accuracy, even when feature selection algorithms are used on the same data [5,6] This is true for feature selection problems in OMICs data analysis, where the number of investigated features is much larger than the number of samples. Multiple stability issues can affect these data sets, and the data sets can contain large number of redundant features [10]

Objectives
Methods
Results

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.