Abstract

BackgroundGenotype imputation is a common technique in genetic research. Genetic similarity between target population and reference dataset is crucial for high-quality results. Although several reference panels are available, it is often not clear which is the most optimal for a particular target dataset to be imputed. Maximizing genetic similarity between study sample and intended reference panels may be the straight forward method for selecting the genetically best-matched reference. However, the impact of genetic similarity on imputation accuracy has not yet been studied in detail.ResultsWe performed a simulation study in 20 ethnic groups obtained from POPRES. High-quality SNPs were masked and re-imputed with MaCH, MaCH-minimac and IMPUTE2 using four different HapMap reference panels (CEU, CHB-JPT, MEX and YRI). Imputation accuracy was assessed by different statistics. Genetic similarity between ethnic groups and reference populations were measured by F -statistics (FST) originally proposed by Wright and G -statistics (GST) introduced by Nei and others. To assess the predictive power of these measures regarding imputation accuracy, we analysed relations between them and corresponding imputation accuracy scores. We found that population genetic distances between homogeneous reference and target populations were strongly linearly correlated with resulting imputation accuracies irrespective of considered distance measure, imputation accuracy measure, missingness and imputation software used. Possible exception was African population.ConclusionUsage of GST or FST-related measures for predicting the optimal reference panel for imputation frameworks relying on a specific reference is highly recommended. A cut-off of GST < 0.01 is recommended to achieve good imputation results for high-frequency variants and small data sets. The linear relationship is less pronounced for low-frequency variants for which we also observed a dependence of imputation accuracy on the number of polymorphic sites in the reference. We also show that the software specific measures MaCH-Rsq and IMPUTE-info must be interpreted with caution if the genetic distance of target and reference population is high.Electronic supplementary materialThe online version of this article (doi:10.1186/s12863-015-0248-2) contains supplementary material, which is available to authorized users.

Highlights

  • Genotype imputation is a common technique applied in the context of genome wide association (GWA) analysis

  • We investigated the cause of this deviation and found that low-frequencyvariants (SNPs with Minor allele frequency (MAF) ≤ 0.05) strongly influence FRST while G -statistics (GST) is robust

  • We conclude that GST is a good predictor of imputation accuracy for all type of imputation frameworks used under the best-matching policy for selecting a reference panel

Read more

Summary

Introduction

Genotype imputation is a common technique applied in the context of genome wide association (GWA) analysis. A set of densely genotyped samples is used as references to infer a large set of un-typed or missing markers in the target population. Strategies for selecting the individuals to be sequenced have been suggested recently [5]. These strategies consider genetic similarities between study population, subsets to be sequenced and the reference panel. Genetic similarity between target population and reference dataset is crucial for high-quality results. Maximizing genetic similarity between study sample and intended reference panels may be the straight forward method for selecting the genetically best-matched reference. The impact of genetic similarity on imputation accuracy has not yet been studied in detail

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.