Imputation of missing genotypes: an empirical evaluation of IMPUTE

Zhenming Zhao,David Hk Chui,Clinton T Baldwin,Stephen W Hartley,Thomas T Perls,Martin H Steinberg,Nadia Timofeev,Supan Fucharoen,Paola Sebastiani

doi:10.1186/1471-2156-9-85

Zhenming Zhao, David Hk Chui + Show 7 more

Open Access

https://doi.org/10.1186/1471-2156-9-85

Copy DOI

Abstract

BackgroundImputation of missing genotypes is becoming a very popular solution for synchronizing genotype data collected with different microarray platforms but the effect of ethnic background, subject ascertainment, and amount of missing data on the accuracy of imputation are not well understood.ResultsWe evaluated the accuracy of the program IMPUTE to generate the genotype data of partially or fully untyped single nucleotide polymorphisms (SNPs). The program uses a model-based approach to imputation that reconstructs the genotype distribution given a set of referent haplotypes and the observed data, and uses this distribution to compute the marginal probability of each missing genotype for each individual subject that is used to impute the missing data. We assembled genome-wide data from five different studies and three different ethnic groups comprising Caucasians, African Americans and Asians. We randomly removed genotype data and then compared the observed genotypes with those generated by IMPUTE. Our analysis shows 97% median accuracy in Caucasian subjects when less than 10% of the SNPs are untyped and missing genotypes are accepted regardless of their posterior probability. The median accuracy increases to 99% when we require 0.95 minimum posterior probability for an imputed genotype to be acceptable. The accuracy decreases to 86% or 94% when subjects are African Americans or Asians. We propose a strategy to improve the accuracy by leveraging the level of admixture in African Americans.ConclusionOur analysis suggests that IMPUTE is very accurate in samples of Caucasians origin, it is slightly less accurate in samples of Asians background, but substantially less accurate in samples of admixed background such as African Americans. Sample size and ascertainment do not seem to affect the accuracy of imputation.

Highlights

Imputation of missing genotypes is becoming a very popular solution for synchronizing genotype data collected with different microarray platforms but the effect of ethnic background, subject ascertainment, and amount of missing data on the accuracy of imputation are not well understood
The median accuracy increases to 99% when we impose a posterior probability greater than 0.95 as the threshold to accept the imputed genotypes
The data are essentially those summarized in column 2 of Table 1 and show a clear skewness of the results with a very small number of single nucleotide polymorphisms (SNP) that failed to be imputed correctly while the majority of SNPs was imputed with large accuracy

Summary

Introduction

Imputation of missing genotypes is becoming a very popular solution for synchronizing genotype data collected with different microarray platforms but the effect of ethnic background, subject ascertainment, and amount of missing data on the accuracy of imputation are not well understood. The most popular solution is implemented in fastPHASE [3], that uses a Hidden Markov model to describe the spatial distribution of clusters of haplotypes along a chromosome and reconstructs individual haplotypes from unphased genotype data using a Bayesian rule. The great potential of this method is to allow investigators to synchronize genotype data that were typed using different platforms and several authors have shown that this approach increases the power of genome-wide association studies [9]

Methods

Results

Conclusion