Scanning and Filling: Ultra-Dense SNP Genotyping Combining Genotyping-By-Sequencing, SNP Array and Whole-Genome Resequencing Data.

Davoud Torkamaneh,Francois Belzile

doi:10.1371/journal.pone.0131533

Abstract

Genotyping-by-sequencing (GBS) represents a highly cost-effective high-throughput genotyping approach. By nature, however, GBS is subject to generating sizeable amounts of missing data and these will need to be imputed for many downstream analyses. The extent to which such missing data can be tolerated in calling SNPs has not been explored widely. In this work, we first explore the use of imputation to fill in missing genotypes in GBS datasets. Importantly, we use whole genome resequencing data to assess the accuracy of the imputed data. Using a panel of 301 soybean accessions, we show that over 62,000 SNPs could be called when tolerating up to 80% missing data, a five-fold increase over the number called when tolerating up to 20% missing data. At all levels of missing data examined (between 20% and 80%), the resulting SNP datasets were of uniformly high accuracy (96–98%). We then used imputation to combine complementary SNP datasets derived from GBS and a SNP array (SoySNP50K). We thus produced an enhanced dataset of >100,000 SNPs and the genotypes at the previously untyped loci were again imputed with a high level of accuracy (95%). Of the >4,000,000 SNPs identified through resequencing 23 accessions (among the 301 used in the GBS analysis), 1.4 million tag SNPs were used as a reference to impute this large set of SNPs on the entire panel of 301 accessions. These previously untyped loci could be imputed with around 90% accuracy. Finally, we used the 100K SNP dataset (GBS + SoySNP50K) to perform a GWAS on seed oil content within this collection of soybean accessions. Both the number of significant marker-trait associations and the peak significance levels were improved considerably using this enhanced catalog of SNPs relative to a smaller catalog resulting from GBS alone at ≤20% missing data. Our results demonstrate that imputation can be used to fill in both missing genotypes and untyped loci with very high accuracy and that this leads to more powerful genetic analyses.

Highlights

Generation sequencing (NGS) has revolutionized plant and animal research in many ways
Using a panel of 301 soybean accessions, we show that over 62,000 single nucleotide polymorphisms (SNPs) could be called when tolerating up to 80% missing data, a five-fold increase over the number called when tolerating up to 20% missing data
We first explored the impact of two key filtering steps central to the production of SNP catalogs derived from GBS analysis: the maximal amount of missing data allowed (MaxMD, in %) and the minimal minor allele frequency (MinMAF)

Summary

Introduction

Generation sequencing (NGS) has revolutionized plant and animal research in many ways. Generation sequencing has facilitated greatly the development of methods to genotype very large numbers of molecular markers such as single nucleotide polymorphisms (SNPs). In one such approach, largescale sequencing has allowed researchers to probe nucleotide diversity in panels of individuals to discover polymorphic sites and to develop genotyping arrays (“SNP chips”) that can subsequently be used to determine the genotype of an individual line at thousands to millions of such SNPs [4,5]. An example of this approach is the SoySNP50K array that was constructed to interrogate over 52K SNPs of which 47,337 were found to be polymorphic among a set of 288 elite cultivars, landraces and wild soybean accessions [6]. RAD-Seq (Restriction site Associated DNA Sequencing) and genotyping-by-sequencing (GBS) are two examples of such SNP genotyping approaches relying on NGS [7,8]

Methods

Results

Conclusion