Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data

Mickael Leclercq,Olivier Perin,Yves Fradet,Arnaud Droit,Alain Bergeron,Benjamin Vittrant,Marie Laure Martin-Magniette,Marie Pier Scott Boyer

doi:10.3389/fgene.2019.00452

Abstract

The identification of biomarker signatures in omics molecular profiling is usually performed to predict outcomes in a precision medicine context, such as patient disease susceptibility, diagnosis, prognosis, and treatment response. To identify these signatures, we have developed a biomarker discovery tool, called BioDiscML. From a collection of samples and their associated characteristics, i.e., the biomarkers (e.g., gene expression, protein levels, clinico-pathological data), BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. BioDiscML is delivered as a stand-alone program and is available for download at https://github.com/mickaelleclercq/BioDiscML.

Highlights

The identification of biomarkers that are indicative of a specific biological state is a major research topic in biomedical applications of computational biology (Liu et al, 2014; Beerenwinkel et al, 2016; Zhang et al, 2017)
Considering the complexity of the machine learning (ML) approach, we present in this paper a software called BioDiscML (Biomarker Discovery by Machine Learning), which aims to greatly facilitate the work required for biomarker signature identification from highdimensional data, such as gene expression, by automating the ML approach
We compared BioDiscML to various recent approaches dedicated to biomarker discovery and modeling, including MINT (Rohart et al, 2017a), AucPR (Yu and Park, 2014), and RGIFE (Swan et al, 2015) to demonstrate the better predictive performances that BioDiscML offers on various omics datasets

Summary

Introduction

The identification of biomarkers that are indicative of a specific biological state is a major research topic in biomedical applications of computational biology (Liu et al, 2014; Beerenwinkel et al, 2016; Zhang et al, 2017). Research studies involving cohorts of patients aim to discover patterns that establish risk stratification and discriminate patient states, such as diseased vs controls, disease type, etc. These last years, clinical and biology research turned toward extensive usage of OMICs (i.e., proteomics, transcriptomics, metabolomics, genomics, etc.) technologies, which include microarrays, mass spectrometry, and whole exome/genome and RNA sequencing. Specific patterns associated with a clinical outcome of interest (e.g., disease diagnostic, prognostic), called biomarker signatures, can be derived from these high-dimensional technologies outputs (e.g., gene expression, polymorphisms) (Lin et al, 2017).

Methods

Results

Discussion

Conclusion