Abstract
Genome-wide association studies have successfully identified common variants that are associated with complex diseases. However, the majority of genetic variants contributing to disease susceptibility are yet to be discovered. It is now widely believed that multiple rare variants are likely to be associated with complex diseases. Using custom-made chips or next-generation sequencing to uncover the effects of rare variants on the disease can be very expensive in current technology. Consequently, many researchers use the genotype imputation approach to predict the genotypes at these rare variants that are not directly genotyped in the study sample. One important question in genotype imputation is how to choose a reference panel that will produce high imputation accuracy in a population of interest. Using whole genome sequence data from the Genetic Analysis Workshop 18 data set, this report compares genotype imputation accuracy among reference panels representing different degrees of genetic similarity to a study sample of admixed Mexican Americans. Results show that a reference panel that closely matches the ancestry of the study population can increase imputation accuracy, but it can also result in more missing genotype calls. Having a larger-size reference panel can reduce imputation error and missing genotype, but the improvement may be limited. We also find that, for the admixed study sample, the simple selection of a single best-reference panel among HapMap African, European, or Asian population is not appropriate. The composite reference panel combining all available reference data should be used.
Highlights
Large-scale genome-wide association studies (GWAS) based on common variants genotyping have only identified a small fraction of the heritable variation of complex diseases
Discordance and missing rates are calculated based on the 773,165 singlenucleotide polymorphisms (SNPs) that are present in both 1000 Genomes phase 1 and whole genome sequence (WGS) data, but not present in the GWAS data
Genetic Analysis Workshop 18 (GAW18)-WGS can have higher missing genotype rates than 1000 Genomes references for most thresholds. These results may indicate that a reference panel that closely matches the ancestry of the study population can increase imputation accuracy, but this can risk losing diversity and make it harder to identify haplotype sharing with simple models, thereby resulting more missing genotype calls
Summary
Large-scale genome-wide association studies (GWAS) based on common variants (a minor allele frequency [MAF]≥5%) genotyping have only identified a small fraction of the heritable variation of complex diseases. Many researchers use the genotype imputation approach to predict the genotypes at these rare variants that are not directly genotyped in the study sample [3] These predicted genotypes can be Imputation methods work by combining a reference panel of individuals genotyped at a dense set of singlenucleotide polymorphisms (SNPs) with a study sample genotyped at a subset of these sites [4]. One might only include the individuals who most closely match the ancestry of the study population as the reference panel [7] This “best match” strategy reduces the computational burden of imputation, but it can yield suboptimal accuracy with using partial information of diverse reference collections, or in studies with no clear reference matches (e.g., admixed populations) [6]. Several studies [6,8] have compared and discussed various choices of reference panels
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.