Partially connected neural networks for complex trait prediction: application to human height.
Partially connected neural networks for complex trait prediction: application to human height.
- Research Article
12
- 10.1016/j.xplc.2024.101002
- Jun 13, 2024
- Plant Communications
Dual-extraction modeling: A multi-modal deep-learning architecture for phenotypic prediction and functional gene mining of complex traits
- Research Article
156
- 10.1186/s12711-020-00531-z
- Feb 24, 2020
- Genetics Selection Evolution
BackgroundTransforming large amounts of genomic data into valuable knowledge for predicting complex traits has been an important challenge for animal and plant breeders. Prediction of complex traits has not escaped the current excitement on machine-learning, including interest in deep learning algorithms such as multilayer perceptrons (MLP) and convolutional neural networks (CNN). The aim of this study was to compare the predictive performance of two deep learning methods (MLP and CNN), two ensemble learning methods [random forests (RF) and gradient boosting (GB)], and two parametric methods [genomic best linear unbiased prediction (GBLUP) and Bayes B] using real and simulated datasets.MethodsThe real dataset consisted of 11,790 Holstein bulls with sire conception rate (SCR) records and genotyped for 58k single nucleotide polymorphisms (SNPs). To support the evaluation of deep learning methods, various simulation studies were conducted using the observed genotype data as template, assuming a heritability of 0.30 with either additive or non-additive gene effects, and two different numbers of quantitative trait nucleotides (100 and 1000).ResultsIn the bull dataset, the best predictive correlation was obtained with GB (0.36), followed by Bayes B (0.34), GBLUP (0.33), RF (0.32), CNN (0.29) and MLP (0.26). The same trend was observed when using mean squared error of prediction. The simulation indicated that when gene action was purely additive, parametric methods outperformed other methods. When the gene action was a combination of additive, dominance and of two-locus epistasis, the best predictive ability was obtained with gradient boosting, and the superiority of deep learning over the parametric methods depended on the number of loci controlling the trait and on sample size. In fact, with a large dataset including 80k individuals, the predictive performance of deep learning methods was similar or slightly better than that of parametric methods for traits with non-additive gene action.ConclusionsFor prediction of traits with non-additive gene action, gradient boosting was a robust method. Deep learning approaches were not better for genomic prediction unless non-additive variance was sizable.
- Research Article
7
- 10.1016/j.plantsci.2021.111153
- Dec 13, 2021
- Plant Science
Accurate prediction of complex traits for individuals and offspring from parents using a simple, rapid, and efficient method for gene-based breeding in cotton and maize
- Dissertation
1
- 10.18174/390205
- May 8, 2019
In this thesis we describe the results of a number of quantitative techniques that were used to understand the genetics of yield in pepper as an example of complex trait measured in a number of environments. Main objectives were; i) to propose a number of mixed models to detect QTLs for multiple traits and multiple environments, ii) to extend the multi-trait QTL models to a multi-trait genomic prediction model, iii) to study how well the complex trait yield can be indirectly predicted from its component traits, and iv) to understand the ‘causal’ relationships between the target trait yield and its component traits. The thesis is part of an EU-FP7 project “ S mart tools for P rediction and I mprovements of C rop Y ield” (SPICY- http://www.spicyweb.eu/). This project generated phenotypic data from four environments using 149 individuals from the sixth generation of recombinant inbred lines obtained from intraspecific cross between large – fruited inbred pepper cultivar ‘Yolo Wonder’ (YW) and the hot pepper cultivar ‘Criollo de Morelos 334’ (CM 334). A total of 16 physiological traits were evaluated across the four trials and various types of genetic parameters were estimated. In a first analysis, the traits were univariately analyzed using linear mixed model. Trait heritabilities were generally large (ranging between 0.43 – 0.96 with an average of 0.86) and mostly comparable across trials while many of the traits displayed heterosis and transgression. The same QTLs were detected across the four trials, though QTL magnitude differed for many of the traits. We also found that some QTLs affected more than one trait, suggesting QTL pleiotropy (a QTL region affecting more than one trait). We discussed our results in the light of previously reported QTLs for these and similar traits in pepper. We addressed the presence of genotype-by-environment interaction (GEI) in yield and the other traits through a multi-environment (ME) mixed model methodology with terms for QTL-by-environment interaction (QEI). We opined that yield would benefit from joint analysis with other traits and so deployed two other mixed model based multi-response QTL approaches: a multi-trait approach (MT) and a multi-trait multi-environment approach (MTME). For yield as well as the other traits, MTME was superior to ME and MT in the number of QTLs, the explained variance and accuracy of predictions. Many of the detected QTLs were pleiotropic and showed quantitative QEI. The results confirmed the feasibility and strengths of novel mixed model QTL methodology to study the architecture of complex traits. The QTL methods considered thus far are not well suited for prediction purposes as only a limited set of QTL-related markers are used. Since the main interest of this research includes improvement of yield prediction, we explored both single-trait and multi-trait versions of genomic prediction (GP) models as alternatives to the QTL-based prediction (QP) models. This was termed direct prediction. The methods differed in their predictive accuracies with GP methods outperforming QP methods in both single and multi-traits situations. We borrowed ideas from crop growth model (CGM) to dissect complex trait yield into a number of its component traits. Here, we integrated QTL/genomic prediction and CGM approaches and showed that the target trait yield can be predicted via its component traits together with environmental covariables. This was termed indirect prediction. The CGM approach seemed to work well at first sight, but this is especially due to the fact that yield appeared to be strongly driven by just one of its components, the partitioning to fruit. An alternative representation of the biological knowledge of a complex target trait such as yield is provided by network type models. We constructed both conditional and unconditional networks across the four environments to understand the ‘causal’ relationships between target trait yield and its component traits. The final networks for each environment from both conditional and unconditional methods were used in a structural equation model to assess the causal relationships. Conditioning QTL mapping on network structure improved detection of refined genetic architecture by distinguishing between QTL with direct and indirect effects, thereby removing non-significant effects found in the unconditional network and resolving QTL pleiotropy. Similar to the CGM topology, yield was established to be downstream to its component traits, indicating that yield can be studied and predicted from its component traits. Thus, the genetic improvements of yield would benefit from improvements on the component traits. Finally, complex trait prediction can be enhanced by a full integration of the methods described in the different chapters. Recent research efforts have been channelled to incorporating both multivariate whole genome prediction models and crop growth models. Further research is required, but we hope that the present thesis presents useful steps towards better prediction models for complex traits exhibiting genotype by environment interaction.
- Research Article
31
- 10.1534/genetics.119.302934
- Feb 1, 2020
- Genetics
A multiple-trait Bayesian LASSO (MBL) for genome-based analysis and prediction of quantitative traits is presented and applied to two real data sets. The data-generating model is a multivariate linear Bayesian regression on possibly a huge number of molecular markers, and with a Gaussian residual distribution posed. Each (one per marker) of the vectors of regression coefficients (T: number of traits) is assigned the same T−variate Laplace prior distribution, with a null mean vector and unknown scale matrix Σ. The multivariate prior reduces to that of the standard univariate Bayesian LASSO when The covariance matrix of the residual distribution is assigned a multivariate Jeffreys prior, and Σ is given an inverse-Wishart prior. The unknown quantities in the model are learned using a Markov chain Monte Carlo sampling scheme constructed using a scale-mixture of normal distributions representation. MBL is demonstrated in a bivariate context employing two publicly available data sets using a bivariate genomic best linear unbiased prediction model (GBLUP) for benchmarking results. The first data set is one where wheat grain yields in two different environments are treated as distinct traits. The second data set comes from genotyped Pinus trees, with each individual measured for two traits: rust bin and gall volume. In MBL, the bivariate marker effects are shrunk differentially, i.e., “short” vectors are more strongly shrunk toward the origin than in GBLUP; conversely, “long” vectors are shrunk less. A predictive comparison was carried out as well in wheat, where the comparators of MBL were bivariate GBLUP and bivariate Bayes Cπ—a variable selection procedure. A training-testing layout was used, with 100 random reconstructions of training and testing sets. For the wheat data, all methods produced similar predictions. In Pinus, MBL gave better predictions that either a Bayesian bivariate GBLUP or the single trait Bayesian LASSO. MBL has been implemented in the Julia language package JWAS, and is now available for the scientific community to explore with different traits, species, and environments. It is well known that there is no universally best prediction machine, and MBL represents a new resource in the armamentarium for genome-enabled analysis and prediction of complex traits.
- Research Article
69
- 10.1534/genetics.115.177204
- Aug 6, 2015
- Genetics
Prediction of complex traits using molecular genetic information is an active area in quantitative genetics research. In the postgenomic era, many types of -omic (e.g., transcriptomic, epigenomic, methylomic, and proteomic) data are becoming increasingly available. Therefore, evaluating the utility of this massive amount of information in prediction of complex traits is of interest. DNA methylation, the covalent change of a DNA molecule without affecting its underlying sequence, is one quantifiable form of epigenetic modification. We used methylation information for predicting plant height (PH) in Arabidopsis thaliana nonparametrically, using reproducing kernel Hilbert spaces (RKHS) regression. Also, we used different criteria for selecting smaller sets of probes, to assess how representative probes could be used in prediction instead of using all probes, which may lessen computational burden and lower experimental costs. Methylation information was used for describing epigenetic similarities between individuals through a kernel matrix, and the performance of predicting PH using this similarity matrix was reasonably good. The predictive correlation reached 0.53 and the same value was attained when only preselected probes were used for prediction. We created a kernel that mimics the genomic relationship matrix in genomic best linear unbiased prediction (G-BLUP) and estimated that, in this particular data set, epigenetic variation accounted for 65% of the phenotypic variance. Our results suggest that methylation information can be useful in whole-genome prediction of complex traits and that it may help to enhance understanding of complex traits when epigenetics is under examination.
- Research Article
147
- 10.1371/journal.pgen.1004754
- Nov 13, 2014
- PLoS Genetics
Regularized machine learning in the genetic prediction of complex traits.
- Supplementary Content
3
- 10.3390/genes14081630
- Aug 16, 2023
- Genes
A high number of genome variants are associated with complex traits, mainly due to genome-wide association studies (GWAS). Using polygenic risk scores (PRSs) is a widely accepted method for calculating an individual’s complex trait prognosis using such data. Unlike monogenic traits, the practical implementation of complex traits by applying this method still falls behind. Calculating PRSs from all GWAS data has limited practical usability in behaviour traits due to statistical noise and the small effect size from a high number of genome variants involved. From a behaviour traits perspective, complex traits are explored using the concept of core genes from an omnigenic model, aiming to employ a simplified calculation version. Simplification may reduce the accuracy compared to a complete PRS encompassing all trait-associated variants. Integrating genome data with datasets from various disciplines, such as IT and psychology, could lead to better complex trait prediction. This review elucidates the significance of clear biological pathways in understanding behaviour traits. Specifically, it highlights the essential role of genes related to hormones, enzymes, and neurotransmitters as robust core genes in shaping these traits. Significant variations in core genes are prominently observed in behaviour traits such as stress response, impulsivity, and substance use.
- Research Article
242
- 10.1371/journal.pgen.1002051
- Apr 28, 2011
- PLoS Genetics
Despite rapid advances in genomic technology, our ability to account for phenotypic variation using genetic information remains limited for many traits. This has unfortunately resulted in limited application of genetic data towards preventive and personalized medicine, one of the primary impetuses of genome-wide association studies. Recently, a large proportion of the “missing heritability” for human height was statistically explained by modeling thousands of single nucleotide polymorphisms concurrently. However, it is currently unclear how gains in explained genetic variance will translate to the prediction of yet-to-be observed phenotypes. Using data from the Framingham Heart Study, we explore the genomic prediction of human height in training and validation samples while varying the statistical approach used, the number of SNPs included in the model, the validation scheme, and the number of subjects used to train the model. In our training datasets, we are able to explain a large proportion of the variation in height (h2 up to 0.83, R2 up to 0.96). However, the proportion of variance accounted for in validation samples is much smaller (ranging from 0.15 to 0.36 depending on the degree of familial information used in the training dataset). While such R2 values vastly exceed what has been previously reported using a reduced number of pre-selected markers (<0.10), given the heritability of the trait (∼0.80), substantial room for improvement remains.
- Research Article
1
- 10.5808/gi.2010.8.3.142
- Sep 30, 2010
- Genomics & Informatics
In recent years, genome-wide association (GWA) studies have successfully led to many discoveries of genetic variants affecting common complex traits, including height, blood pressure, and diabetes. Although GWA studies have made much progress in finding single nucleotide polymorphisms (SNPs) associated with many complex traits, such SNPs have been shown to explain only a very small proportion of the underlying genetic variance of complex traits. This is partly due to that fact that most current GWA studies have relied on single-marker approaches that identify single genetic factors individually and have limitations in considering the joint effects of multiple genetic factors on complex traits. Joint identification of multiple genetic factors would be more powerful and provide a better prediction of complex traits, since it utilizes combined information across variants. Recently, a new statistical method for joint identification of genetic variants for common complex traits via the elastic-net regularization method was proposed. In this study, we applied this joint identification approach to a large-scale GWA dataset (i.e., 8842 samples and 327,872 SNPs) in order to identify genetic variants of obesity for the Korean population. In addition, in order to test for the biological significance of the jointly identified SNPs, gene ontology and pathway enrichment analyses were further conducted.
- Research Article
6
- 10.3389/fpls.2022.800161
- Apr 29, 2022
- Frontiers in Plant Science
Prediction of complex traits based on genome-wide marker information is of central importance for both animal and plant breeding. Numerous models have been proposed for the prediction of complex traits and still considerable effort has been given to improve the prediction accuracy of these models, because various genetics factors like additive, dominance and epistasis effects can influence of the prediction accuracy of such models. Recently machine learning (ML) methods have been widely applied for prediction in both animal and plant breeding programs. In this study, we propose a new algorithm for genomic prediction which is based on neural networks, but incorporates classical elements of LASSO. Our new method is able to account for the local epistasis (higher order interaction between the neighboring markers) in the prediction. We compare the prediction accuracy of our new method with the most commonly used prediction methods, such as BayesA, BayesB, Bayesian Lasso (BL), genomic BLUP and Elastic Net (EN) using the heterogenous stock mouse and rice field data sets.
- Research Article
41
- 10.1186/1471-2164-15-109
- Jan 1, 2014
- BMC Genomics
BackgroundGenome-wide association studies have been deemed successful for identifying statistically associated genetic variants of large effects on complex traits. Past studies have found enrichment of trait-associated SNPs in functionally annotated regions, while depletion was reported for intergenic regions (IGR). However, no systematic examination of connections between genomic regions and predictive ability of complex phenotypes has been carried out.ResultsIn this study, we partitioned SNPs based on their annotation to characterize genomic regions that deliver low and high predictive power for three broiler traits in chickens using a whole-genome approach. Additive genomic relationship kernels were constructed for each of the genic regions considered, and a kernel-based Bayesian ridge regression was employed as prediction machine. We found that the predictive performance for ultrasound area of breast meat from using genic regions marked by SNPs was consistently better than that from SNPs in IGR, while IGR tagged by SNPs were better than the genic regions for body weight and hen house egg production. We also noted that predictive ability delivered by the whole battery of markers was close to the best prediction achieved by one of the genomic regions.ConclusionsWhole-genome regression methods use all available quality filtered SNPs into a model, contrary to accommodating only validated SNPs from exonic or coding regions. Our results suggest that, while differences among genomic regions in terms of predictive ability were observed, the whole-genome approach remains as a promising tool if interest is on prediction of complex traits.
- Research Article
5
- 10.3390/plants11172190
- Aug 24, 2022
- Plants
Whole-genome multi-omics profiles contain valuable information for the characterization and prediction of complex traits in plants. In this study, we evaluate multi-omics models to predict four complex traits in barley (Hordeum vulgare); grain yield, thousand kernel weight, protein content, and nitrogen uptake. Genomic, transcriptomic, and DNA methylation data were obtained from 75 spring barley lines tested in the RadiMax semi-field phenomics facility under control and water-scarce treatment. By integrating multi-omics data at genomic, transcriptomic, and DNA methylation regulatory levels, a higher proportion of phenotypic variance was explained (0.72–0.91) than with genomic models alone (0.55–0.86). The correlation between predictions and phenotypes varied from 0.17–0.28 for control plants and 0.23–0.37 for water-scarce plants, and the increase in accuracy was significant for nitrogen uptake and protein content compared to models using genomic information alone. Adding transcriptomic and DNA methylation information to the prediction models explained more of the phenotypic variance attributed to the environment in grain yield and nitrogen uptake. It furthermore explained more of the non-additive genetic effects for thousand kernel weight and protein content. Our results show the feasibility of multi-omics prediction for complex traits in barley.
- Research Article
48
- 10.1002/gepi.21808
- May 2, 2014
- Genetic Epidemiology
High-confidence prediction of complex traits such as disease risk or drug response is an ultimate goal of personalized medicine. Although genome-wide association studies have discovered thousands of well-replicated polymorphisms associated with a broad spectrum of complex traits, the combined predictive power of these associations for any given trait is generally too low to be of clinical relevance. We propose a novel systems approach to complex trait prediction, which leverages and integrates similarity in genetic, transcriptomic, or other omics-level data. We translate the omic similarity into phenotypic similarity using a method called Kriging, commonly used in geostatistics and machine learning. Our method called OmicKriging emphasizes the use of a wide variety of systems-level data, such as those increasingly made available by comprehensive surveys of the genome, transcriptome, and epigenome, for complex trait prediction. Furthermore, our OmicKriging framework allows easy integration of prior information on the function of subsets of omics-level data from heterogeneous sources without the sometimes heavy computational burden of Bayesian approaches. Using seven disease datasets from the Wellcome Trust Case Control Consortium (WTCCC), we show that OmicKriging allows simple integration of sparse and highly polygenic components yielding comparable performance at a fraction of the computing time of a recently published Bayesian sparse linear mixed model method. Using a cellular growth phenotype, we show that integrating mRNA and microRNA expression data substantially increases performance over either dataset alone. Using clinical statin response, we show improved prediction over existing methods. We provide an R package to implement OmicKriging (http://www.scandb.org/newinterface/tools/OmicKriging.html).
- Research Article
75
- 10.1186/s12711-015-0097-5
- Mar 31, 2015
- Genetics, Selection, Evolution : GSE
BackgroundRecently, artificial neural networks (ANN) have been proposed as promising machines for marker-based genomic predictions of complex traits in animal and plant breeding. ANN are universal approximators of complex functions, that can capture cryptic relationships between SNPs (single nucleotide polymorphisms) and phenotypic values without the need of explicitly defining a genetic model. This concept is attractive for high-dimensional and noisy data, especially when the genetic architecture of the trait is unknown. However, the properties of ANN for the prediction of future outcomes of genomic selection using real data are not well characterized and, due to high computational costs, using whole-genome marker sets is difficult. We examined different non-linear network architectures, as well as several genomic covariate structures as network inputs in order to assess their ability to predict milk traits in three dairy cattle data sets using large-scale SNP data. For training, a regularized back propagation algorithm was used. The average correlation between the observed and predicted phenotypes in a 20 times 5-fold cross-validation was used to assess predictive ability. A linear network model served as benchmark.ResultsPredictive abilities of different ANN models varied markedly, whereas differences between data sets were small. Dimension reduction methods enhanced prediction performance in all data sets, while at the same time computational cost decreased. For the Holstein-Friesian bull data set, an ANN with 10 neurons in the hidden layer achieved a predictive correlation of r=0.47 for milk yield when the entire marker matrix was used. Predictive ability increased when the genomic relationship matrix (r=0.64) was used as input and was best (r=0.67) when principal component scores of the marker genotypes were used. Similar results were found for the other traits in all data sets.ConclusionArtificial neural networks are powerful machines for non-linear genome-enabled predictions in animal breeding. However, to produce stable and high-quality outputs, variable selection methods are highly recommended, when the number of markers vastly exceeds sample size.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.