Abstract
Genome-wide association studies (GWAS) have revealed thousands of genetic loci that underpin the complex biology of many human traits. However, the strength of GWAS – the ability to detect genetic association by linkage disequilibrium (LD) – is also its limitation. Whilst the ever-increasing study size and improved design have augmented the power of GWAS to detect effects, differentiation of causal variants or genes from other highly correlated genes associated by LD remains the real challenge. This has severely hindered the biological insights and clinical translation of GWAS findings. Although thousands of disease susceptibility loci have been reported, causal genes at these loci remain elusive. Machine learning (ML) techniques offer an opportunity to dissect the heterogeneity of variant and gene signals in the post-GWAS analysis phase. ML models for GWAS prioritization vary greatly in their complexity, ranging from relatively simple logistic regression approaches to more complex ensemble models such as random forests and gradient boosting, as well as deep learning models, i.e., neural networks. Paired with functional validation, these methods show important promise for clinical translation, providing a strong evidence-based approach to direct post-GWAS research. However, as ML approaches continue to evolve to meet the challenge of causal gene identification, a critical assessment of the underlying methodologies and their applicability to the GWAS prioritization problem is needed. This review investigates the landscape of ML applications in three parts: selected models, input features, and output model performance, with a focus on prioritizations of complex disease associated loci. Overall, we explore the contributions ML has made towards reaching the GWAS end-game with consequent wide-ranging translational impact.
Highlights
A genome-wide association study (GWAS) examines a genomewide set of genetic variants in a group of individuals to identify variants associated with a trait or phenotype
This review has focused on post-Genome-wide association studies (GWAS) Machine learning (ML) prioritization methodologies ranging from model selection and input features, to performance assessment and output prioritization results
How that data is collected and recorded affects the reliability of ML methods and comparison of model performances. This point can be made for models such as ExPecto or iMEGES firstly applying variant prediction which feeds into gene prioritization as a feature (Khan et al, 2018; Zhou et al, 2018), as there is a risk of the predicted features overfitting, and those features not being reproducible
Summary
A genome-wide association study (GWAS) examines a genomewide set of genetic variants in a group of individuals to identify variants associated with a trait or phenotype. As GWAS studies have scaled up to discover ever more disease variants (Evangelou et al, 2018; Giri et al, 2019; Nalls et al, 2019) it has become impractical to perform functional investigation on all disease relevant loci This limitation arises in part due to variability in reporting of GWAS results, some studies report loci which have been independently replicated in a different cohort (the gold standard approach), and others do not. A compounding factor is the need to differentiate causal variants or genes from other genes associated by linkage disequilibrium (LD), confounding the detection of causal genes within a locus – making it unclear which variants and genes warrant further analysis and potential functional study This range of issues undermines the robustness of GWAS, and challenges the validity of downstream analyses and biological hypothesis development, critically undermining some of the major motivators for performing GWAS in the first place, such as target validation (Hurle et al, 2016). This highlights the need for computational solutions to improve the signal to noise ratio of GWAS results and to highlight genes and variants that are most likely to be causal
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.