Abstract
Most genomic prediction models are linear regression models that assume continuous and normally distributed phenotypes, but responses to diseases such as stripe rust (caused by Puccinia striiformis f. sp. tritici) are commonly recorded in ordinal scales and percentages. Disease severity (SEV) and infection type (IT) data in germplasm screening nurseries generally do not follow these assumptions. On this regard, researchers may ignore the lack of normality, transform the phenotypes, use generalized linear models, or use supervised learning algorithms and classification models with no restriction on the distribution of response variables, which are less sensitive when modeling ordinal scores. The goal of this research was to compare classification and regression genomic selection models for skewed phenotypes using stripe rust SEV and IT in winter wheat. We extensively compared both regression and classification prediction models using two training populations composed of breeding lines phenotyped in 4 years (2016–2018 and 2020) and a diversity panel phenotyped in 4 years (2013–2016). The prediction models used 19,861 genotyping-by-sequencing single-nucleotide polymorphism markers. Overall, square root transformed phenotypes using ridge regression best linear unbiased prediction and support vector machine regression models displayed the highest combination of accuracy and relative efficiency across the regression and classification models. Furthermore, a classification system based on support vector machine and ordinal Bayesian models with a 2-Class scale for SEV reached the highest class accuracy of 0.99. This study showed that breeders can use linear and non-parametric regression models within their own breeding lines over combined years to accurately predict skewed phenotypes.
Highlights
Genomic selection (GS) is posed to increase genetic gain and reduce cycle time for complex agronomic traits that are difficult to phenotype and analyze (Meuwissen et al, 2001)
The means of the diverse association mapping panel (DP) were higher than the breeding lines (BL) trials, with lower coefficients of variation (CV)
The varying results for the classification and transformation methods displayed the need to choose the prediction model carefully based on the phenotype distribution
Summary
Genomic selection (GS) is posed to increase genetic gain and reduce cycle time for complex agronomic traits that are difficult to phenotype and analyze (Meuwissen et al, 2001). With the advent of high-throughput genotyping, it is feasible to develop and implement GS models for categorical/ordinal phenotypes that are common in most breeding programs and often difficult to analyze. Most GS models are linear regression models that assume continuous and normally distributed phenotypes (MontesinosLópez et al, 2015c). When faced with data that do not follow the assumption of a linear model, researchers have several options. They may either ignore the lack of normality, transform the phenotypes, use generalized linear models (GLMs), or use machine learning (ML) algorithms and classification models. Most GS models treat disease resistance as continuous values and utilize regression models and transformations for prediction whereas only a few studies have used classification methods (Ornella et al, 2012; Ornella et al, 2014; Rutkoski et al, 2014; Muleta et al, 2017)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.