Exploring Deep Learning for Complex Trait Genomic Prediction in Polyploid Outcrossing Species.

Laura M Zingaretti,Salvador Alejandro Gezan,Amparo Monfort,Patricio R Muñoz,Miguel Pérez-Enciso,Luis F Osorio,Luis Felipe V Ferrão,Vance M Whitaker

doi:10.3389/fpls.2020.00025

Abstract

Genomic prediction (GP) is the procedure whereby the genetic merits of untested candidates are predicted using genome wide marker information. Although numerous examples of GP exist in plants and animals, applications to polyploid organisms are still scarce, partly due to limited genome resources and the complexity of this system. Deep learning (DL) techniques comprise a heterogeneous collection of machine learning algorithms that have excelled at many prediction tasks. A potential advantage of DL for GP over standard linear model methods is that DL can potentially take into account all genetic interactions, including dominance and epistasis, which are expected to be of special relevance in most polyploids. In this study, we evaluated the predictive accuracy of linear and DL techniques in two important small fruits or berries: strawberry and blueberry. The two datasets contained a total of 1,358 allopolyploid strawberry (2n=8x=112) and 1,802 autopolyploid blueberry (2n=4x=48) individuals, genotyped for 9,908 and 73,045 single nucleotide polymorphism (SNP) markers, respectively, and phenotyped for five agronomic traits each. DL depends on numerous parameters that influence performance and optimizing hyperparameter values can be a critical step. Here we show that interactions between hyperparameter combinations should be expected and that the number of convolutional filters and regularization in the first layers can have an important effect on model performance. In terms of genomic prediction, we did not find an advantage of DL over linear model methods, except when the epistasis component was important. Linear Bayesian models were better than convolutional neural networks for the full additive architecture, whereas the opposite was observed under strong epistasis. However, by using a parameterization capable of taking into account these non-linear effects, Bayesian linear models can match or exceed the predictive accuracy of DL. A semiautomatic implementation of the DL pipeline is available at https://github.com/lauzingaretti/deepGP/.

Highlights

Deep learning (DL) techniques comprise a heterogeneous collection of machine learning algorithms which have excelled at many prediction tasks, and this is a very active area of research (Min et al, 2017; Pattanayak, 2017; Namin et al, 2018)
We evaluated the performance of deep learning for genomic prediction in two important horticultural species: allo-octoploid strawberry and auto-tetraploid blueberry
We considered that predicting performance of the last stage lines (T10) is the most interest application for the industry and the population was divided between training (T2, T4, T6, and T8 trials) and validation (T10) subsets with 965 and 393 lines, respectively

Summary

Introduction

Deep learning (DL) techniques comprise a heterogeneous collection of machine learning algorithms which have excelled at many prediction tasks, and this is a very active area of research (Min et al, 2017; Pattanayak, 2017; Namin et al, 2018). Genomic selection (GS) is the breeding strategy consisting in predicting complex traits using genomic-wide genetic markers. Due to the decrease in genotyping costs, genomic selection is becoming the standard tool in many plant and animal breeding programs (Bernardo, 2008; González-Camacho et al, 2012; Crossa et al, 2013b; Meuwissen et al, 2013; Wiggans et al, 2017). There is an increasing number of successful applications of genomic selection in diploid and polyploid organisms where its use has generated important genetic gains by improving the accuracy of breeding value prediction and dramatically reducing generation intervals (Crossa et al, 2013a; Castillo-Juárez et al, 2015; Duangjit et al, 2016; Juliana et al, 2019; de Bem Oliveira et al, 2019)

Methods

Results

Conclusion