Abstract

Accurate identification of DNA polymorphisms using next-generation sequencing technology is challenging because of a high rate of sequencing error and incorrect mapping of reads to reference genomes. Currently available short read aligners and DNA variant callers suffer from these problems. We developed the Coval software to improve the quality of short read alignments. Coval is designed to minimize the incidence of spurious alignment of short reads, by filtering mismatched reads that remained in alignments after local realignment and error correction of mismatched reads. The error correction is executed based on the base quality and allele frequency at the non-reference positions for an individual or pooled sample. We demonstrated the utility of Coval by applying it to simulated genomes and experimentally obtained short-read data of rice, nematode, and mouse. Moreover, we found an unexpectedly large number of incorrectly mapped reads in ‘targeted’ alignments, where the whole genome sequencing reads had been aligned to a local genomic segment, and showed that Coval effectively eliminated such spurious alignments. We conclude that Coval significantly improves the quality of short-read sequence alignments, thereby increasing the calling accuracy of currently available tools for SNP and indel identification. Coval is available at http://sourceforge.net/projects/coval105/.

Highlights

  • Next-generation sequencing (NGS) technology has enabled us to determine whole genome sequences and structures rapidly and inexpensively, including DNA polymorphisms, gene structures, and epigenetic alterations, by producing massive amounts of short reads

  • The clustered high-mismatch reads observed in alignments with Illumina reads are likely due to sequence-specific errors stemming from the Illumina sequencing system [2]

  • Because sequence reads containing substantial numbers of errors introduced through this mechanism cannot be aligned to the reference sequence, the errorprone genomic regions tend to have lower read coverage, containing high-mismatch reads that are still tolerated for the alignment

Read more

Summary

Introduction

Next-generation sequencing (NGS) technology has enabled us to determine whole genome sequences and structures rapidly and inexpensively, including DNA polymorphisms, gene structures, and epigenetic alterations, by producing massive amounts of short reads. Nakamura et al have recently reported that GGC-containing genomic regions are prone to sequence-specific errors in Illumina sequencing reactions [2]. The short length (36–110 bp) of the sequence reads often leads to misalignment of the reads to unrelated positions in a reference genome. This is problematic in organisms with genomes containing a large proportion of repetitive sequences. These problems all hinder the accuracy of determination of genomic structures, including DNA polymorphisms, through the alignment of NGS short reads with a reference genome

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call