Increasing calling accuracy, coverage, and read-depth in sequence data by the use of haplotype blocks.

Torsten Pook,Daniel Valle Torres,Henner Simianer,Chris-Carolin Schoen,Eric Gerardo Gonzalez Segovia,Adnane Nemri,Jonathan Marchini

doi:10.1371/journal.pgen.1009944

Torsten Pook, Daniel Valle Torres + Show 5 more

Open Access

https://doi.org/10.1371/journal.pgen.1009944

Copy DOI

Abstract

High-throughput genotyping of large numbers of lines remains a key challenge in plant genetics, requiring geneticists and breeders to find a balance between data quality and the number of genotyped lines under a variety of different existing genotyping technologies when resources are limited. In this work, we are proposing a new imputation pipeline ("HBimpute") that can be used to generate high-quality genomic data from low read-depth whole-genome-sequence data. The key idea of the pipeline is the use of haplotype blocks from the software HaploBlocker to identify locally similar lines and subsequently use the reads of all locally similar lines in the variant calling for a specific line. The effectiveness of the pipeline is showcased on a dataset of 321 doubled haploid lines of a European maize landrace, which were sequenced at 0.5X read-depth. The overall imputing error rates are cut in half compared to state-of-the-art software like BEAGLE and STITCH, while the average read-depth is increased to 83X, thus enabling the calling of copy number variation. The usefulness of the obtained imputed data panel is further evaluated by comparing the performance of sequence data in common breeding applications to that of genomic data generated with a genotyping array. For both genome-wide association studies and genomic prediction, results are on par or even slightly better than results obtained with high-density array data (600k). In particular for genomic prediction, we observe slightly higher data quality for the sequence data compared to the 600k array in the form of higher prediction accuracies. This occurred specifically when reducing the data panel to the set of overlapping markers between sequence and array, indicating that sequencing data can benefit from the same marker ascertainment as used in the array process to increase the quality and usability of genomic data.

Highlights

High-throughput genotyping of large numbers of lines remains a key challenge in plant genetics and breeding
Genotyping arrays are still considered the gold standard in high-throughput quantitative genetics, recent advances in sequencing provide new opportunities
Both the quality and cost of genomic data generated based on sequencing are highly dependent on the used readdepth

Summary

Introduction

High-throughput genotyping of large numbers of lines remains a key challenge in plant genetics and breeding. High-throughput genotyping is commonly performed using single nucleotide polymorphism (SNP) arrays in most common crops and livestock species. Genotyping arrays can have various marker densities, ranging from 10k SNPs [2] to 50k [3, 4] to 600k SNPs [3, 5, 6], are relatively straightforward to use [7], and typically produce robust genomic data with relatively few missing calls or calling errors [6]. Array markers are typically SNPs selected to be in relatively conserved regions of the genome [14, 15], i.e. by design they provide little information on structural variants, calling of structural variation, in principle, is possible via genotyping arrays [16]

Objectives

Methods

Results

Discussion

Conclusion