Training Set Design Research Articles

Key messageHistorical data from breeding programs can be efficiently used to improve genomic selection accuracy, especially when the training set is optimized to subset individuals most informative of the target testing set.The current strategy for large-scale implementation of genomic selection (GS) at the International Maize and Wheat Improvement Center (CIMMYT) global maize breeding program has been to train models using information from full-sibs in a “test-half-predict-half approach.” Although effective, this approach has limitations, as it requires large full-sib populations and limits the ability to shorten variety testing and breeding cycle times. The primary objective of this study was to identify optimal experimental and training set designs to maximize prediction accuracy of GS in CIMMYT’s maize breeding programs. Training set (TS) design strategies were evaluated to determine the most efficient use of phenotypic data collected on relatives for genomic prediction (GP) using datasets containing 849 (DS1) and 1389 (DS2) DH-lines evaluated as testcrosses in 2017 and 2018, respectively. Our results show there is merit in the use of multiple bi-parental populations as TS when selected using algorithms to maximize relatedness between the training and prediction sets. In a breeding program where relevant past breeding information is not readily available, the phenotyping expenditure can be spread across connected bi-parental populations by phenotyping only a small number of lines from each population. This significantly improves prediction accuracy compared to within-population prediction, especially when the TS for within full-sib prediction is small. Finally, we demonstrate that prediction accuracy in either sparse testing or “test-half-predict-half” can further be improved by optimizing which lines are planted for phenotyping and which lines are to be only genotyped for advancement based on GP.

Leveraging the depth and breadth of solutions generated through crowdsourcing can be a powerful accelerator to method development for high consequence problems. While data competitions have become quite popular and prevalent, particularly in supervised learning formats, their implementations by the host are highly variable. Without careful planning, a supervised learning competition is vulnerable to overfitting, where the winning solutions are so closely tuned to the particular set of provided data that they cannot generalize to the general underlying problem of interest to the host. This article outlines important considerations for strategically designing relevant and informative data sets to maximize the learning outcome from hosting a competition. These include: 1. precisely defining the scope of the problem, 2. encouraging participation by competitors from diverse technical backgrounds, 3. specifying the most interesting solution space in order to encourage improvement and distinguish between competitors, 4. strategically generating data sets that enable testing for interpolation and extrapolation to new scenarios of interest, 5. leveraging design of experiment principles for strategic data design while preventing unintentional artifacts in the competition data sets that competitors could exploit without addressing the real problem of interest, and 6. carefully designing the leaderboard scoring metric to select top solutions that closely match the overall competition goals. The methods are illustrated with a recently completed competition in the context of urban radiological search to evaluate algorithms capable of detecting, identifying, and locating radioactive materials. Simulated data were used in the urban search competition. Ideas for using measured real data in competitions are also suggested.

Training Set Design Research Articles

Related Topics

Articles published on Training Set Design

Deep learning insights into non-universality in the halo mass function

Goal‐Oriented Two‐Layered Kernel Models as Automated Surrogates for Surface Kinetics in Reactor Simulations

A kind of numerical model combined with genetic algorithm and back propagation neural network for creep-fatigue life prediction and optimization of double-layered annulus metal hydride reactor and verification of ASME-NH code

Genomic prediction in hybrid breeding: I. Optimizing the training set design

Training set designs for prediction of yield and moisture of maize test cross hybrids with unreplicated trials.

Genomic prediction of tocochromanols in exotic-derived maize.

Sparse kernel models provide optimization of training set design for genomic prediction in multiyear wheat breeding data.

Genomic prediction of hybrid performance: comparison of the efficiency of factorial and tester designs used as training sets in a multiparental connected reciprocal design for maize silage.

LIM Tracker: a software package for cell tracking and analysis with advanced interactivity

Robustness of neural network emulations of radiative transfer parameterizations in a state-of-the-art general circulation model

Informed training set design enables efficient machine learning-assisted directed protein evolution.

Training set design in genomic prediction with multiple biparental families.

CV-α: designing validations sets to increase the precision and enable multiple comparison tests in genomic prediction

Maximizing efficiency of genomic selection in CIMMYT\u2019s tropical maize breeding program

Training set design for machine learning techniques applied to the approximation of computationally intensive first-principles kinetic models

Evaluation of different virtual screening strategies on the basis of compound sets with characteristic core distributions and dissimilarity relationships.

Improved learning from data competitions through strategic design of training and test data sets

Feature parameter extraction and intelligent estimation of the State-of-Health of lithium-ion batteries

The effects of training population design on genomic prediction accuracy in wheat

Quality monitoring in petroleum refinery with regression neural network: Improving prediction accuracy with appropriate design of training set

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Training Set Design Research Articles

Related Topics

Articles published on Training Set Design

Deep learning insights into non-universality in the halo mass function

Goal‐Oriented Two‐Layered Kernel Models as Automated Surrogates for Surface Kinetics in Reactor Simulations

A kind of numerical model combined with genetic algorithm and back propagation neural network for creep-fatigue life prediction and optimization of double-layered annulus metal hydride reactor and verification of ASME-NH code

Genomic prediction in hybrid breeding: I. Optimizing the training set design

Training set designs for prediction of yield and moisture of maize test cross hybrids with unreplicated trials.

Genomic prediction of tocochromanols in exotic-derived maize.

Sparse kernel models provide optimization of training set design for genomic prediction in multiyear wheat breeding data.

Genomic prediction of hybrid performance: comparison of the efficiency of factorial and tester designs used as training sets in a multiparental connected reciprocal design for maize silage.

LIM Tracker: a software package for cell tracking and analysis with advanced interactivity

Robustness of neural network emulations of radiative transfer parameterizations in a state-of-the-art general circulation model

Informed training set design enables efficient machine learning-assisted directed protein evolution.

Training set design in genomic prediction with multiple biparental families.

CV-α: designing validations sets to increase the precision and enable multiple comparison tests in genomic prediction

Maximizing efficiency of genomic selection in CIMMYT\u2019s tropical maize breeding program

Training set design for machine learning techniques applied to the approximation of computationally intensive first-principles kinetic models

Evaluation of different virtual screening strategies on the basis of compound sets with characteristic core distributions and dissimilarity relationships.

Improved learning from data competitions through strategic design of training and test data sets

Feature parameter extraction and intelligent estimation of the State-of-Health of lithium-ion batteries

The effects of training population design on genomic prediction accuracy in wheat

Quality monitoring in petroleum refinery with regression neural network: Improving prediction accuracy with appropriate design of training set