A new genotype imputation method with tolerance to high missing rate and rare variants.

Yumei Yang,Yuchun Pan,Yuchun Pan,Zhiwu Zhang,Zhiwu Zhang,Qishan Wang,Rongrong Liao,Youmin Zheng,Hongjie Yang,Qiang Chen,Xiangzhe Zhang,Xiaodong Cai

doi:10.1371/journal.pone.0101025

Abstract

We report a novel algorithm, iBLUP, to impute missing genotypes by simultaneously and comprehensively using identity by descent and linkage disequilibrium information. The simulation studies showed that the algorithm exhibited drastically tolerance to high missing rate, especially for rare variants than other common imputation methods, e.g. BEAGLE and fastPHASE. At a missing rate of 70%, the accuracy of BEAGLE and fastPHASE dropped to 0.82 and 0.74 respectively while iBLUP retained an accuracy of 0.95. For minor allele, the accuracy of BEAGLE and fastPHASE decreased to −0.1 and 0.03, while iBLUP still had an accuracy of 0.61.We implemented the algorithm in a publicly available software package also named iBLUP. The application of iBLUP for processing real sequencing data in an outbred pig population was demonstrated.

Highlights

Benefited from the advances of sequencing technologies, Genome-Wide Association Studies (GWAS) have revealed substantial genetic loci controlling human diseases and agriculturally important traits [1,2,3]
To take full advantage of a multivariate mixed model (M-MM) to fully incorporate both linkage disequilibrium (LD) and identity by decent (IBD) simultaneously, we made two major changes to enhance the representations of marker IBD information on the relationship matrix (K) among individuals, and marker LD information on the covariance matrix (G) of underlying variables (See Figure 1)
Missing genotype imputation is a critical process between sequencing and utilization for GWAS and genomic prediction [29,30,31]

Summary

Introduction

Benefited from the advances of sequencing technologies, Genome-Wide Association Studies (GWAS) have revealed substantial genetic loci controlling human diseases and agriculturally important traits [1,2,3]. The identified loci collectively explain only a small proportion of total variation [4,5,6,7]. Multiplexing is one the advances that revolutionized the high throughput Genotyping By Sequencing (GBS). Samples are individually tagged and pooled into a single lane of flow cell. It exponentially increases the number of samples analyzed in a single run without dramatically increasing cost and time [9]

Objectives

Methods

Results

Conclusion