Abstract

BackgroundInherent sources of error and bias that affect the quality of sequence data include index hopping and bias towards the reference allele. The impact of these artefacts is likely greater for low-coverage data than for high-coverage data because low-coverage data has scant information and many standard tools for processing sequence data were designed for high-coverage data. With the proliferation of cost-effective low-coverage sequencing, there is a need to understand the impact of these errors and bias on resulting genotype calls from low-coverage sequencing.ResultsWe used a dataset of 26 pigs sequenced both at 2× with multiplexing and at 30× without multiplexing to show that index hopping and bias towards the reference allele due to alignment had little impact on genotype calls. However, pruning of alternative haplotypes supported by a number of reads below a predefined threshold, which is a default and desired step of some variant callers for removing potential sequencing errors in high-coverage data, introduced an unexpected bias towards the reference allele when applied to low-coverage sequence data. This bias reduced best-guess genotype concordance of low-coverage sequence data by 19.0 absolute percentage points.ConclusionsWe propose a simple pipeline to correct the preferential bias towards the reference allele that can occur during variant discovery and we recommend that users of low-coverage sequence data be wary of unexpected biases that may be produced by bioinformatic tools that were designed for high-coverage sequence data.

Highlights

  • Inherent sources of error and bias that affect the quality of sequence data include index hopping and bias towards the reference allele

  • We explored the impact of index hopping and bias towards the reference allele in low-coverage sequence data

  • We show that index hopping and bias towards the reference allele due to alignment have little impact on genotype calls

Read more

Summary

Introduction

Inherent sources of error and bias that affect the quality of sequence data include index hopping and bias towards the reference allele The impact of these artefacts is likely greater for low-coverage data than for highcoverage data because low-coverage data has scant information and many standard tools for processing sequence data were designed for high-coverage data. Two of the most important causes of incorrect genotype calls are index hopping and preferential bias of some bioinformatic tools towards the reference allele The impact of these artefacts is likely greater for low-coverage data than for high-coverage data because low-coverage data has scant information and

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call