Genome-wide prediction using Bayesian additive regression trees

Patrik Waldmann

doi:10.1186/s12711-016-0219-8

Abstract

BackgroundThe goal of genome-wide prediction (GWP) is to predict phenotypes based on marker genotypes, often obtained through single nucleotide polymorphism (SNP) chips. The major problem with GWP is high-dimensional data from many thousands of SNPs scored on several thousands of individuals. A large number of methods have been developed for GWP, which are mostly parametric methods that assume statistical linearity and only additive genetic effects. The Bayesian additive regression trees (BART) method was recently proposed and is based on the sum of nonparametric regression trees with the priors being used to regularize the parameters. Each regression tree is based on a recursive binary partitioning of the predictor space that approximates an unknown function, which will automatically model nonlinearities within SNPs (dominance) and interactions between SNPs (epistasis). In this study, we introduced BART and compared its predictive performance with that of the LASSO, Bayesian LASSO (BLASSO), genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space (RKHS) regression and random forest (RF) methods.ResultsTests on the QTLMAS2010 simulated data, which are mainly based on additive genetic effects, show that cross-validated optimization of BART provides a smaller prediction error than the RF, BLASSO, GBLUP and RKHS methods, and is almost as accurate as the LASSO method. If dominance and epistasis effects are added to the QTLMAS2010 data, the accuracy of BART relative to the other methods was increased. We also showed that BART can produce importance measures on the SNPs through variable inclusion proportions. In evaluations using real data on pigs, the prediction error was smaller with BART than with the other methods.ConclusionsBART was shown to be an accurate method for GWP, in which the regression trees guarantee a very sparse representation of additive and complex non-additive genetic effects. Moreover, the Markov chain Monte Carlo algorithm with Bayesian back-fitting provides a computationally efficient procedure that is suitable for high-dimensional genomic data.Electronic supplementary materialThe online version of this article (doi:10.1186/s12711-016-0219-8) contains supplementary material, which is available to authorized users.

Highlights

The goal of genome-wide prediction (GWP) is to predict phenotypes based on marker genotypes, often obtained through single nucleotide polymorphism (SNP) chips
QTLMAS2010 data For the original QTLMAS2010 dataset, the LASSO with the minimum mean squared error (minMSE) option was found to produce a mean squared prediction error (MSPE) of 62.020, which was the lowest value of all six methods (Table 1)
The analysis of the QTLMAS2010 dataset when dominance and epistatic effects are added resulted in an MSPE of 64.353 for Bayesian additive regression trees (BART) with hyperparameters M = 100, κ = 4 and q = 0.9 (Table 2)

Summary

Introduction

The goal of genome-wide prediction (GWP) is to predict phenotypes based on marker genotypes, often obtained through single nucleotide polymorphism (SNP) chips. Howard et al [7] assessed the performance of ten parametric and four nonparametric methods in terms of accuracy and mean squared error using simulated genetic architectures that consisted of purely additive or two-way epistatic interactions in an F2 population derived from crosses of inbred lines. They found that the parametric methods predicted phenotypic values less accurately when the underlying genetic architecture was entirely based on epistasis, whereas the parametric methods resulted in only slightly better predictions than nonparametric methods when the underlying genetic architecture was additive.

Objectives

Methods

Results

Discussion

Conclusion