Validation of genotype imputation in Southeast Asian populations and the effect of single nucleotide polymorphism annotation on imputation outcome

Worachart Lert-Itthiporn,Prapat Suriyaphol,Harald Grove,Fumihiko Matsuda,Anavaj Sakuntabhai,Bhoom Suktitipat,Nattaya Tangthawornchaikul,Prida Malasit

doi:10.1186/s12881-018-0534-8

Abstract

BackgroundImputation involves the inference of untyped single nucleotide polymorphisms (SNPs) in genome-wide association studies. The haplotypic reference of choice for imputation in Southeast Asian populations is unclear. Moreover, the influence of SNP annotation on imputation results has not been examined.MethodsThis study was divided into two parts. In the first part, we applied imputation to genotyped SNPs from Southeast Asian populations from the Pan-Asian SNP database. Five percent of the total SNPs were removed. The remaining SNPs were applied to imputation with IMPUTE2. The imputed outcomes were verified with the removed SNPs. We compared imputation references from Chinese and Japanese haplotypes from the HapMap phase II (HMII) and the complete set of haplotypes from the 1000 Genomes Project (1000G). The second part was imputation accuracy and yield in Thai patient dataset. Half of the autosomal SNPs was removed to create Set 1. Another dataset, Set 2, was then created where we switched which half of the SNPs were removed. Both Set 1 and Set 2 were imputed with HMII to create a complete imputed SNPs dataset. The dataset was used to validate association testing, SNPs annotation and imputation outcome.ResultsThe accuracy was highest for all populations when using the HMII reference, but at the cost of a lower yield. Thai genotypes showed the highest accuracy over other populations in both HMII and 1000G panels, although accuracy and yield varied across chromosomes. Imputation was tested in a clinical dataset to compare accuracy in gene-related regions, and coding regions were found to have a higher accuracy and yield.ConclusionsThis work provides the first evidence of imputation reference selection for Southeast Asian studies and highlights the effects of SNP locations respective to genes on imputation outcome. Researchers will need to consider the trade-off between accuracy and yield in future imputation studies.

Highlights

Imputation involves the inference of untyped single nucleotide polymorphisms (SNPs) in genome-wide association studies
Imputation with HapMap phase II (HMII) as a reference gave an average accuracy of 96.57%, while for 1000 Genomes Project (1000G), the accuracy was 93.98% (Fig. 1a)
The yield for each population was lower when imputation was performed with the HMII reference compared to the 1000G reference (Fig. 1b)

Summary

Introduction

Imputation involves the inference of untyped single nucleotide polymorphisms (SNPs) in genome-wide association studies. One way to overcome this problem is using imputation, a process in which samples are genotyped using a low-density SNP array and imputed with information from a reference panel genotyped on a high-density SNP array. This method will recover genotypes that are missing because of technical issues. One study of malaria resistance in Gambian children only identified a previously known hemoglobin S variant in the hemoglobin-β gene when a Gambian-specific reference was used [1] This problem is more likely to occur in Africa, where there is a considerably lower LD compared to Europe and Asia [4], determining how to choose the best reference is relevant for any study performing imputation with publicly available reference sets

Methods

Results

Discussion

Conclusion