Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

Elisabetta Manduchi,Stefano Ruberto,Jason H Moore,Joseph D Romano,Weixuan Fu

doi:10.1186/s12859-020-03755-4

Elisabetta Manduchi, Stefano Ruberto + Show 3 more

Open Access

https://doi.org/10.1186/s12859-020-03755-4

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Oct 1, 2020
Citations: 17	License type: open-access

Affiliation: University of Pennsylvania

Abstract

BackgroundA typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis.ResultsWe developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj.ConclusionsIn this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.

Highlights

A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model
A desirable feature of Tree-based Pipeline Optimization Tool (TPOT) is the ability to adjust for relevant covariates, as this is important in the biomedical context where there are often either baseline characteristics of the subjects or batch effects whose influence on the target or the features needs to be removed so to isolate the actual effects of the features on the target
TG‐GATEs To fully exploit this large expression data set to identify pathways and genes directly associated to creatinine levels, we needed to factor out the confounding effect of compound treatment

Summary

Introduction

A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. In biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. The Tree-based Pipeline Optimization Tool (TPOT) [1, 2] is a genetic programming (GP) based AutoML which has been successfully used in biomedical applications. A desirable feature of TPOT is the ability to adjust for relevant covariates, as this is important in the biomedical context where there are often either baseline characteristics of the subjects or batch effects whose influence on the target or the features needs to be removed so to isolate the actual effects of the features on the target. It is important to note that, while common in biostatistics and epidemiology, covariate adjustment is uncommon and understudied in machine learning

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Large scale biomedical data analysis with tree-based automated machine learning
Trang T Le ... Jason H Moore
-
Trang T Le, et. al.Trang T Le ... Jason H Moore
08 Jul 2020
08 Jul 2020

Scaling tree-based automated machine learning to biomedical big data with a feature set selector.
Trang T Le ... Weixuan Fu
Bioinformatics | VOL. 36
Trang T Le, et. al.Trang T Le ... Weixuan Fu
04 Jun 2019
Bioinformatics | VOL. 36

FLOW PATTERN PREDICTION IN HORIZONTAL AND INCLINED PIPES USING TREE-BASED AUTOMATED MACHINE LEARNING
Agash Uthayasuriyan ... Jeyakumar Gurusamy
Rudarsko-geološko-naftni zbornik | VOL. 39
Agash Uthayasuriyan, et. al.Agash Uthayasuriyan ... Jeyakumar Gurusamy
01 Jan 2024
Rudarsko-geološko-naftni zbornik | VOL. 39

Determining the Capability of the Tree-Based Pipeline Optimization Tool (TPOT) in Mapping Parthenium Weed Using Multi-Date Sentinel-2 Image Data
Zolo Kiala ... John Odindi
Remote Sensing | VOL. 14
Zolo Kiala, et. al.Zolo Kiala ... John Odindi
31 Mar 2022
Remote Sensing | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics