Abstract

Machine Learning (ML) approaches are increasingly being used in biomedical applications. Important challenges of ML include choosing the right algorithm and tuning the parameters for optimal performance. Automated ML (AutoML) methods, such as Tree-based Pipeline Optimization Tool (TPOT), have been developed to take some of the guesswork out of ML thus making this technology available to users from more diverse backgrounds. The goals of this study were to assess applicability of TPOT to genomics and to identify combinations of single nucleotide polymorphisms (SNPs) associated with coronary artery disease (CAD), with a focus on genes with high likelihood of being good CAD drug targets. We leveraged public functional genomic resources to group SNPs into biologically meaningful sets to be selected by TPOT. We applied this strategy to data from the UK Biobank, detecting a strikingly recurrent signal stemming from a group of 28 SNPs. Importance analysis of these SNPs uncovered functional relevance of the top SNPs to genes whose association with CAD is supported in the literature and other resources. Furthermore, we employed game-theory based metrics to study SNP contributions to individual-level TPOT predictions and discover distinct clusters of well-predicted CAD cases. The latter indicates a promising approach towards precision medicine.

Highlights

  • In recent years, Machine Learning (ML) has gained in- prostate cancer aggressiveness as the endpoint discovcreased appreciation as an alternative or complementary ered several feature combinations that significantly conmethodology to statistical approaches in ‘omics’ data tributed to the classification accuracy [5]

  • From the UK Biobank (UKB) data, we extracted all subjects of white British ancestry and retained a maximal subset of unrelated individuals whose genetically inferred sex matched the sex information collected at recruitment

  • Our aim was to explore with Tree-based Pipeline Optimization Tool (TPOT) potentially interesting coronary artery disease (CAD) associations in functionally derived groups of single nucleotide polymorphisms (SNPs), beyond those within the known strongest main effect loci

Read more

Summary

Introduction

Machine Learning (ML) has gained in- prostate cancer aggressiveness as the endpoint discovcreased appreciation as an alternative or complementary ered several feature combinations that significantly conmethodology to statistical approaches in ‘omics’ data tributed to the classification accuracy [5]. The endpoint of interest were used to reduce the number of particular appeal are Automated ML (AutoML) meth- of features to the manageable size of ~200 Single Nucleoods, which assist (potentially non-expert) users in the de- tide Polymorphisms (SNPs) Even with this biological sign and optimization of ML pipelines [4]. The predictive performance was much lower developed a genetic programming-(GP-)based AutoML than that achieved in the other TPOT ‘omics’ applications named Tree-based Pipeline Optimization Tool (TPOT) [5], cited above This is, not surprising considering [6], which has been successfully used to analyze data from the challenges associated with complex trait GWAS data, metabolomics [7], [8], transcriptomics [9], [10], and toxi- such as missing heritability, typically small effect sizes of cogenomics [10]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call