Predicting Phenotypes From High-Dimensional Genomes Using Gradient Boosting Decision Trees

Tingxi Yu,Wuping Zhang,Fuzhong Li,Jiwan Han,Li Wang,Guofang Xing,Chunqing Cao

doi:10.1109/access.2022.3171341

Abstract

xsxsGenomic selection (GS) is an emerging technique for predicting unknown phenotypes using genome-wide marker coverage, allowing the use of efficient computational models to select individuals with high phenotypic values as candidate breeding populations. However, GS remains challenging inefficient crop breeding due to the limited size of training populations, the nature of genotype-environment interactions, and the complex interaction patterns between molecular markers. In this study, we use ensemble learning algorithms to construct gradient boosted decision tree (GBDT) models to achieve the prediction of phenotypic values from genotypic markers. We trained GBDT using the wheat GS dataset and compared the predictive performance with six other widely used GS models. The mean normalized discounted cumulative gain (MNDCG) method was used to evaluate the ability of each model to select individuals with high phenotypic values. The results of the study show that: (1) Bayesian models converge and reach a steady-state only when a sufficient number of iterations are set. As the number of iterations increases, the prediction accuracy of the Bayesian model increases, but the computational efficiency of the model decreases significantly. When 200,000 iterations are performed, the prediction performance of the five Bayesian models is similar and converges to a smooth state, and their prediction accuracy is 7.60% better than the GBDT model overall, and the computational efficiency of the GBDT model is 70 times that of the Bayesian model. (2) Overall, the overall prediction performance of the RRBLUP model was the best, but for some traits, the GBDT model still had a higher ability to select individuals with high phenotypic values than the RRBLUP and Bayesian models. (3) The prediction accuracy of GBDT and RRBLUP models was influenced by the subset of markers, and the higher the number of markers the higher the prediction accuracy of the models, so the reasonable selection of genetic marker data of appropriate size could improve the prediction performance of the models.

Full Text