Abstract

A polygenic risk score estimates the genetic risk of an individual for some disease or trait, calculated by aggregating the effect of many common variants associated with the condition. With the increasing availability of genetic data in large cohort studies such as the UK Biobank, inclusion of this genetic risk as a covariate in statistical analyses is becoming more widespread. Previously this required specialist knowledge, but as tooling and data availability have improved it has become more feasible for statisticians and epidemiologists to calculate existing scores themselves for use in analyses. While tutorial resources exist for conducting genome-wide association studies and generating of new polygenic risk scores, fewer guides exist for the simple calculation and application of existing genetic scores. This guide outlines the key steps of this process: selection of suitable polygenic risk scores from the literature, extraction of relevant genetic variants and verification of their quality, calculation of the risk score and key considerations of its inclusion in statistical models, using the UK Biobank imputed data as a model data set. Many of the techniques in this guide will generalize to other datasets, however we also focus on some of the specific techniques required for using data in the formats UK Biobank have selected. This includes some of the challenges faced when working with large numbers of variants, where the computation time required by some tools is impractical. While we have focused on only a couple of tools, which may not be the best ones for every given aspect of the process, one barrier to working with genetic data is the sheer volume of tools available, and the difficulty for a novice to assess their viability. By discussing in depth a couple of tools that are adequate for the calculation even at large scale, we hope to make polygenic risk scores more accessible to a wider range of researchers.

Highlights

  • A polygenic risk score (PRS), sometimes called polygenic score (PGS) or genetic risk score (GRS), is an estimate of an individual’s genetic risk for some trait, obtained by aggregating and quantifying the effect of many common variants in the genome, each of which can have a small effect on a person’s genetic risk for a given disease or condition

  • Dedicated PRS tools like PRSice-2 (Choi and O’Reilly, 2019) can be used, but these were designed for those wishing to develop a new PRS from scratch, offering more complex functionalities and assuming a level of domain expertise that may be off-putting for a beginner/ casual user

  • We chose the PRS for low-density lipoprotein cholesterol developed by Klarin et al (2018) in the Million Veteran Program data, because it is a relatively recent PRS that provides a comprehensive selection of single nucleotide polymorphisms (SNPs) in the context of the current literature

Read more

Summary

Introduction

A polygenic risk score (PRS), sometimes called polygenic score (PGS) or genetic risk score (GRS), is an estimate of an individual’s genetic risk for some trait, obtained by aggregating and quantifying the effect of many common variants (usually defined as minor allele frequency ≥1%) in the genome, each of which can have a small effect on a person’s genetic risk for a given disease or condition. These locations contain known variants of interest—so genotyping is good at identifying which known variants a person has, but not at finding new variants Genotype imputation uses a reference panel to estimate genotypes at locations that were not directly called by statistical inference Heritability is the amount of observable (phenotypic) variation among individuals of a population that is due to genetic variation between the individuals Linkage disequilibrium (LD) is a measure of the correlation between neighbouring genetic variants that are more likely to be inherited together because of their physical proximity, leading to association within a population Physical location of a gene or DNA polymorphism on a chromosome (plural “loci”) When there is more than one possible variant nucleotide (in addition to the reference) at a location, we say this location is “multi-allelic” Sequencing enables the exact sequence of bases in a length of DNA to be determined This technique can be used on targeted areas such as the exome, it is becoming increasingly cost effective to do whole genome sequencing The phenotype of an organism is its observable characteristics, for example its physical appearance The rsID for a SNP is the unique RefSNP ID number identifying the “reference SNP cluster” containing this SNP in dbSNP. These analyses may validate the association between the PRS and the trait of interest resulting score is approximately normally distributed in the general population, with higher scores indicating higher risk (Figure 1)

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.