CARE 2.0: reducing false-positive sequencing error corrections using machine learning

Felix Kallenborn,Julian Cascitti,Bertil Schmidt

doi:10.1186/s12859-022-04754-3

Felix Kallenborn, Julian Cascitti + Show 1 more

Open Access

https://doi.org/10.1186/s12859-022-04754-3

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Jun 13, 2022
Citations: 9	License type: open-access

Affiliation: Johannes Gutenberg University Mainz

Abstract

BackgroundNext-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools.ResultsWe present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0’s hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data.ConclusionFalse-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

CARE 2.0: reducing false-positive sequencing error corrections using machine learning

Abstract

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

A comprehensive evaluation of long read error correction methods
Haowen Zhang ... Chirag Jain
BMC Genomics | VOL. 21
Haowen Zhang, et. al.Haowen Zhang ... Chirag Jain
01 Dec 2020
BMC Genomics | VOL. 21

Peptizer, a Tool for Assessing False Positive Peptide Identifications and Manually Validating Selected Results
Kenny Helsens ... Lennart Martens
Molecular & Cellular Proteomics | VOL. 7
Kenny Helsens, et. al.Kenny Helsens ... Lennart Martens
01 Dec 2008
Molecular & Cellular Proteomics | VOL. 7

Abstract 1077: Use of the SVClassify algorithm to classify pediatric solid tumor translocation variant calls as likely true or false positives
Jo Lynne Harenza ... Justin Zook
Cancer Research | VOL. 75
Jo Lynne Harenza, et. al.Jo Lynne Harenza ... Justin Zook
01 Aug 2015
Cancer Research | VOL. 75

Unintended consequences of Mayo paraneoplastic evaluations.
Matthew J Ebright ... Brian C Callaghan
Neurology | VOL. 91
Matthew J Ebright, et. al.Matthew J Ebright ... Brian C Callaghan
26 Oct 2018
Neurology | VOL. 91

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CARE 2.0: reducing false-positive sequencing error corrections using machine learning

Abstract

Talk to us

Similar Papers

More From: BMC Bioinformatics