Abstract

Automated DNA sequencers generate chromatograms that contain raw sequencing data. They also generate data that translates the chromatograms into molecular sequences of A, C, G, T, or N (undetermined) characters. Since chromatogram translation programs frequently introduce errors, a manual inspection of the generated sequence data is required. As sequence numbers and lengths increase, visual inspection and manual correction of chromatograms and corresponding sequences on a per-peak and per-nucleotide basis becomes an error-prone, time-consuming, and tedious process. Here, we introduce ChromatoGate (CG), an open-source software that accelerates and partially automates the inspection of chromatograms and the detection of sequencing errors for bidirectional sequencing runs. To provide users full control over the error correction process, a fully automated error correction algorithm has not been implemented. Initially, the program scans a given multiple sequence alignment (MSA) for potential sequencing errors, assuming that each polymorphic site in the alignment may be attributed to a sequencing error with a certain probability. The guided MSA assembly procedure in ChromatoGate detects chromatogram peaks of all characters in an alignment that lead to polymorphic sites, given a user-defined threshold. The threshold value represents the sensitivity of the sequencing error detection mechanism. After this pre-filtering, the user only needs to inspect a small number of peaks in every chromatogram to correct sequencing errors. Finally, we show that correcting sequencing errors is important, because population genetic and phylogenetic inferences can be misled by MSAs with uncorrected mis-calls. Our experiments indicate that estimates of population mutation rates can be affected two- to three-fold by uncorrected errors.

Highlights

  • Genomic sequence analysis is an important task in bioinformatics population genetic parameters)

  • We propose an approach for systematic detection and correction aScientific Computing Group, HITS gGmbH, Heidelberg, Germany of sequencing errors in multiple sequence alignment (MSA) that relies on chromatogram data, bInstitute of Marine Biology and Genetics, HCMR, Heraklion Crete, Greece denoted as the “CGF framework”

  • The Ambiguous Character Detection (ACD) entry indicates significantly along a genomic segment [21], we simplified the error that the ambiguous character was found at site 155 of sequence seqX simulation process by assuming that base mis-calls are distributed in the MSA

Read more

Summary

Introduction

Genomic sequence analysis is an important task in bioinformatics population genetic parameters). The quality of the initial MSA and computational biology. Several applications, such as phylogenetic is of primary importance. Tree reconstruction or inference of population genetic parameters rely. Only partially different or slightly erroneous MSAs on genomic sequence data. Phylogenetic studies can be used to can yield substantially different parameter values. Determine how a virus spreads over the globe [1] or to describe major alignment errors can mislead the branch-site test [3] for positive shifts in the diversification rates of plants [2]. Population genetics can selection such that it returns unacceptably high false positives [4]

Objectives
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.