Comparison of error correction algorithms for Ion Torrent PGM data: application to hepatitis B virus

Liting Song,Juan Kang,Wenxun Huang,Yuan Huang,Hong Ren,Keyue Ding

doi:10.1038/s41598-017-08139-y

Liting Song, Juan Kang + Show 4 more

Open Access

PDF Available

https://doi.org/10.1038/s41598-017-08139-y

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Ion Torrent Personal Genome Machine (PGM) technology is a mid-length read, low-cost and high-speed next-generation sequencing platform with a relatively high insertion and deletion (indel) error rate. A full systematic assessment of the effectiveness of various error correction algorithms in PGM viral datasets (e.g., hepatitis B virus (HBV)) has not been performed. We examined 19 quality-trimmed PGM datasets for the HBV reverse transcriptase (RT) region and found a total error rate of 0.48% ± 0.12%. Deletion errors were clearly present at the ends of homopolymer runs. Tests using both real and simulated data showed that the algorithms differed in their abilities to detect and correct errors and that the error rate and sequencing depth significantly affected the performance. Of the algorithms tested, Pollux showed a better overall performance but tended to over-correct ‘genuine’ substitution variants, whereas Fiona proved to be better at distinguishing these variants from sequencing errors. We found that the combined use of Pollux and Fiona gave the best results when error-correcting Ion Torrent PGM viral data.

Highlights

Several algorithms have been proposed to correct sequencing errors for Personal Genome Machine (PGM) data (Table 1)
Ultra-deep sequencing has been widely used for analyses of viral populations[32, 33] and enables the examination of the diversity of the whole viral population and the identification of important variants present within the viral population at low frequencies
Bragg et al.[26] described the biases and errors introduced by PGM across a combination of factors in two bacterial species

Summary

Introduction

Several algorithms have been proposed to correct sequencing errors for PGM data (Table 1) These algorithms differ with respect to error models, statistical techniques, data features, the determined parameters, and performances. Sequencing data generated in NGS platforms were analyzed in four microbial genomes to assess the coverage distribution, bias, GC distribution, variant detection and accuracy[13]. These algorithms have not been fully assessed and applied to viral sequencing data (e.g., hepatitis B virus, HBV). Since the RT lacks proofreading, errors in HBV DNA replication occur at a relatively higher rate than other DNA viruses, with an estimated nucleotide substitution rate of www.nature.com/scientificreports/

Methods

Results

Conclusion