Abstract
MotivationGiven a set of biallelic molecular markers, such as SNPs, with genotype values on a collection of plant, animal or human samples, the goal of quantitative genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Quantitative genetic trait prediction is usually represented as linear regression models which require quantitative encodings for the genotypes: the three distinct genotype values, corresponding to one heterozygous and two homozygous alleles, are usually coded as integers, and manipulated algebraically in the model. Further, epistasis between multiple markers is modeled as multiplication between the markers: it is unclear that the regression model continues to be effective under this. In this work we investigate the effects of encodings to the quantitative genetic trait prediction problem.ResultsWe first showed that different encodings lead to different prediction accuracies, in many test cases. We then proposed a data-driven encoding strategy, where we encode the genotypes according to their distribution in the phenotypes and we allow each marker to have different encodings. We show in our experiments that this encoding strategy is able to improve the performance of the genetic trait prediction method and it is more helpful for the oligogenic traits, whose values rely on a relatively small set of markers. To the best of our knowledge, this is the first paper that discusses the effects of encodings to the genetic trait prediction problem.
Highlights
Whole genome prediction of complex phenotypic traits using high-density genotyping arrays has attracted a lot of attention, as it is relevant to the fields of plant and animal breeding and genetic epidemiology [1,2,3,4,5,6,7,8]
Each with m ≫ n genotype values and a trait value, and a set of n′ test samples each with the same set of genotype values but without trait value, the task is to train a predictive model from the training samples to predict the trait value, or phenotype of each test sample based on their genotype values
Effects of different encodings We first illustrate that different encodings lead to different prediction performances for epistasis model defined in Formula 2, a data-driven encoding has the benefit that we do not need to worry about the selection of the encodings
Summary
Whole genome prediction of complex phenotypic traits using high-density genotyping arrays has attracted a lot of attention, as it is relevant to the fields of plant and animal breeding and genetic epidemiology [1,2,3,4,5,6,7,8]. The authors present four approaches: least-squares estimation, BayesA, BayesB, and rrBLUP (Ridge-Regression BLUP), based on a linear model of the marker effects on the trait being studied. The latter three methods are still competitive with the state-of-art techniques, and have . The problem is usually represented as the following linear regression model: m
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.