Abstract

BackgroundThe single molecule, real time (SMRT) sequencing technology of Pacific Biosciences enables the acquisition of transcripts from end to end due to its ability to produce extraordinarily long reads (>10 kb). This new method of transcriptome sequencing has been applied to several projects on humans and model organisms. However, the raw data from SMRT sequencing are of relatively low quality, with a random error rate of approximately 15 %, for which error correction using next-generation sequencing (NGS) short reads is typically necessary. Few tools have been designed that apply a hybrid sequencing approach that combines NGS and SMRT data, and the most popular existing tool for error correction, LSC, has computing resource requirements that are too intensive for most laboratory and research groups. These shortcomings severely limit the application of SMRT long reads for transcriptome analysis.ResultsHere, we report an improved tool (LSCplus) for error correction with the LSC program as a reference. LSCplus overcomes the disadvantage of LSC’s time consumption and improves quality. Only 1/3–1/4 of the time and 1/20–1/25 of the error correction time is required using LSCplus compared with that required for using LSC.ConclusionsLSCplus is freely available at http://www.herbbol.org:8001/lscplus/. Sample calculations are provided illustrating the precision and efficiency of this method regarding error correction and isoform detection.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1316-y) contains supplementary material, which is available to authorized users.

Highlights

  • The single molecule, real time (SMRT) sequencing technology of Pacific Biosciences enables the acquisition of transcripts from end to end due to its ability to produce extraordinarily long reads (>10 kb)

  • Many users have reported that LSC has computing resource requirements that are too intensive for most laboratory and research groups, which severely limits the application of SMRT long reads for transcriptome analysis

  • We addressed some of the shortcomings of LSC to improve this method

Read more

Summary

Results

Quality and time consumption We evaluated the correction efficiency of LSCplus compared with that of the existing pipelines LSC (v1_beta and v2), PacBioToCA (v8.1, hybrid-correction) and LoRDEC (v0.5) using two real biological datasets, one library of long PacBio reads and one library of RNA-seq short reads: (1) human brain cerebellum polyA RNA processed to enrich for full-length cDNA for the PacBio RS platform under C2 chemistry conditions as LR data [20] (174,246 PacBio long reads, http://www.healthcare.uiowa.edu/labs/ au/LSC/files/human_cerebellum_PacBioLR.zip) and (2) human brain data from Illumina’s Human Body Map 2.0 project (GSE30611, 64,313,204 single-end reads, 75 bp) as SR data. After applying the pipelines LSCplus (v2.25), LSC (v1_beta and v2), PacBioToCA (v8.1, hybrid-correction) and LoRDEC (v0.5), we obtained the output and summarized the results.

Background
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call