Abstract

During the last few years, DNA and RNA sequencing have started to play an increasingly important role in biological and medical applications, especially due to the greater amount of sequencing data yielded from the new sequencing machines and the enormous decrease in sequencing costs. Particularly, Illumina/Solexa sequencing has had an increasing impact on gathering data from model and non-model organisms. However, accurate and easy to use tools for quality filtering have not yet been established. We present ConDeTri, a method for content dependent read trimming for next generation sequencing data using quality scores of each individual base. The main focus of the method is to remove sequencing errors from reads so that sequencing reads can be standardized. Another aspect of the method is to incorporate read trimming in next-generation sequencing data processing and analysis pipelines. It can process single-end and paired-end sequence data of arbitrary length and it is independent from sequencing coverage and user interaction. ConDeTri is able to trim and remove reads with low quality scores to save computational time and memory usage during de novo assemblies. Low coverage or large genome sequencing projects will especially gain from trimming reads. The method can easily be incorporated into preprocessing and analysis pipelines for Illumina data.Availability and implementationFreely available on the web at http://code.google.com/p/condetri.

Highlights

  • Data set SRR063698 showed much better and higher Illumina quality scores than the data coming from SRR063699 which makes the two data sets especially useful to test the influence of trimming on ‘good’ and ‘bad’ sequencing data

  • We used data from the collared flycatcher (Ficedula albicollis) genomesequencing project to test the performance of CONDETRI on a non-model species where no genome sequence is available

  • We found one single nucleotide polymorphism (SNP) every 1,299 bp in the untrimmed data and one every,1,450 in the trimmed data sets, regardless which trimming method was applied, which was consistent with our predictions

Read more

Summary

Introduction

Since Sanger sequencing [1] was introduced, many genomes have been sequenced, including large eukaryotic genomes such as human, mouse and chicken. Several generation sequencing (NGS) methods have been released and established in biological and medical sciences NGS techniques differ from traditional Sanger sequencing among others with respect to the error probabilities of each read. For Illumina sequencing, the probability of sequencing errors increases exponentially from the 59 to the 39 end of a sequencing read [4]. Read accuracy is crucial to consider when using NGS data because it affects the assembly and mapping process, and downstream applications like single nucleotide polymorphism (SNP) discovery and copy number variation (CNV) identification

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call