Highly Accurate and Efficient Data-Driven Methods for Genotype Imputation.

Olivia Choudhury,Scott J Emrich,Ankush Chakrabarty

doi:10.1109/tcbb.2017.2708701

Abstract

High-throughput sequencing techniques have generated massive quantities of genotype data. Haplotype phasing has proven to be a useful and effective method for analyzing these data. However, the quality of phasing is undermined due to missing information. Imputation provides an effective means of improving the underlying genotype information. For model organisms, imputation can rely on an available reference genotype panel and a physical or genetic map. For non-model organisms, which often do not have a genotype panel, it is important to design an imputation technique that does not rely on reference data. Here, we present Accurate Data-Driven Imputation Technique (ADDIT), which is composed of two data-driven algorithms capable of handling data generated from model and non-model organisms. The non-model variant of ADDIT (referred to as ADDIT-NM) employs statistical inference methods to impute missing genotypes, whereas the model variant (referred to as ADDIT-M) leverages a supervised learning-based approach for imputation. We demonstrate that both variants of ADDIT are more accurate, faster, and require less memory than leading state-of-the-art imputation tools using model (human) and non-model (maize, apple, and grape) genotype data. Software Availability: The source code of ADDIT and test data sets are available at https://github.com/NDBL/ADDIT.

Full Text