Abstract

Genotype imputation estimates the genotypes of unobserved variants using the genotype data of other observed variants based on a collection of haplotypes for thousands of individuals, which is known as a haplotype reference panel. In general, more accurate imputation results were obtained using a larger size of haplotype reference panel. Most of the existing genotype imputation methods explicitly require the haplotype reference panel in precise form, but the accessibility of haplotype data is often limited, due to the requirement of agreements from the donors. Since de-identified information such as summary statistics or model parameters can be used publicly, imputation methods using de-identified haplotype reference information might be useful to enhance the quality of imputation results under the condition where the access of the haplotype data is limited. In this study, we proposed a novel imputation method that handles the reference panel as its model parameters by using bidirectional recurrent neural network (RNN). The model parameters are presented in the form of de-identified information from which the restoration of the genotype data at the individual-level is almost impossible. We demonstrated that the proposed method provides comparable imputation accuracy when compared with the existing imputation methods using haplotype datasets from the 1000 Genomes Project (1KGP) and the Haplotype Reference Consortium. We also considered a scenario where a subset of haplotypes is made available only in de-identified form for the haplotype reference panel. In the evaluation using the 1KGP dataset under the scenario, the imputation accuracy of the proposed method is much higher than that of the existing imputation methods. We therefore conclude that our RNN-based method is quite promising to further promote the data-sharing of sensitive genome data under the recent movement for the protection of individuals’ privacy.

Highlights

  • The development of high-throughput sequencing technologies enabled the construction of genotype data with base-level resolution for more than one thousand individuals

  • Genotype data obtained using the SNP array is limited to the designed markers, genotype data with sequencing-level resolution obtained from genotype imputation enables the detection of more trait-related variants in genome-wide association studies (GWAS) and more accurate estimation of trait heritability and polygenic risk scores [1,2,3]

  • We proposed a hybrid model obtained by combining two bidirectional recurrent neural network (RNN) models trained for different minor allele frequency (MAF) ranges as well as a new data augmentation process for more robust and accurate estimation

Read more

Summary

Introduction

The development of high-throughput sequencing technologies enabled the construction of genotype data with base-level resolution for more than one thousand individuals. Genotype data obtained using the SNP array is limited to the designed markers, genotype data with sequencing-level resolution obtained from genotype imputation enables the detection of more trait-related variants in GWAS and more accurate estimation of trait heritability and polygenic risk scores [1,2,3]. The imputation methods based on the Li and Stephens model consider phased genotypes obtained using SNP array or other genotyping technologies as input genotype data, and estimate the haplotypes that match with the input genotype data by considering the recombinations of haplotypes present in the haplotype reference panel. Genotypes of unobserved variants are obtained from the estimated haplotypes

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call