Benchmarking of computational error-correction methods for next-generation sequencing data

Keith Mitchell,Sei Chang,Linus Chen,Jan Schroeder,Kevin Hsieh,Brian Hill ,Aaron Karlsberg,Russell Littman,Alexander Zelikovsky ,Ekaterina S Gerasimov ,Haiou Yang ,Jaqueline J Brito,Eli Littman,Qiaozhen Wu,Ren Sun,Nicholas C Wu,Pavel Skums,German Enik,Taylor Shabani,S P Knyazev ,Douglas Yao,Lana S Martin,Igor Mandric,Eleazar Eskin,Mihai Pop,Serghei Mangul

doi:10.1186/s13059-020-01988-3

Abstract

BackgroundRecent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.ResultsIn this paper, we evaluate the ability of error correction algorithms to fix errors across different types of datasets that contain various levels of heterogeneity. We highlight the advantages and limitations of computational error correction techniques across different domains of biology, including immunogenomics and virology. To demonstrate the efficacy of our technique, we apply the UMI-based high-fidelity sequencing protocol to eliminate sequencing errors from both simulated data and the raw reads. We then perform a realistic evaluation of error-correction methods.ConclusionsIn terms of accuracy, we find that method performance varies substantially across different types of datasets with no single method performing best on all types of examined data. Finally, we also identify the techniques that offer a good balance between precision and sensitivity.

Highlights

Rapid advancements in next-generation sequencing have improved our ability to study the genomic material of a biological sample at an unprecedented scale and promise to revolutionize our understanding of living systems [1]
Bases fall into the category of trimming, true negative (TN), true positive (TP), false negative (FN), and false positive (FP)
We evaluate the ability of error correction algorithms to fix errors across different types of datasets with various levels of heterogeneity

Summary

Introduction

Rapid advancements in next-generation sequencing have improved our ability to study the genomic material of a biological sample at an unprecedented scale and promise to revolutionize our understanding of living systems [1]. B Error-free reads for gold standard were generated using UMI-based clustering. Reads were grouped based on matching UMIs and corrected by consensus, where an 80% majority was required to correct sequencing errors without affecting naturally occurring single nucleotide variations (SNVs). Multiple sequence alignment between the error-free, uncorrected (original), and corrected reads was performed to classify bases in the corrected read. Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown

Methods

Results

Conclusion