Subset selection of high-depth next generation sequencing reads for de novo genome assembly using MapReduce framework.

Chih-Hao Fang,Yu-Jung Chang,Ping-Heng Hsieh,Chung-Yen Lin,Wei-Chun Chung,Jan-Ming Ho

doi:10.1186/1471-2164-16-s12-s9

Abstract

BackgroundRecent progress in next-generation sequencing technology has afforded several improvements such as ultra-high throughput at low cost, very high read quality, and substantially increased sequencing depth. State-of-the-art high-throughput sequencers, such as the Illumina MiSeq system, can generate ~15 Gbp sequencing data per run, with >80% bases above Q30 and a sequencing depth of up to several 1000x for small genomes. Illumina HiSeq 2500 is capable of generating up to 1 Tbp per run, with >80% bases above Q30 and often >100x sequencing depth for large genomes. To speed up otherwise time-consuming genome assembly and/or to obtain a skeleton of the assembly quickly for scaffolding or progressive assembly, methods for noise removal and reduction of redundancy in the original data, with almost equal or better assembly results, are worth studying.ResultsWe developed two subset selection methods for single-end reads and a method for paired-end reads based on base quality scores and other read analytic tools using the MapReduce framework. We proposed two strategies to select reads: MinimalQ and ProductQ. MinimalQ selects reads with minimal base-quality above a threshold. ProductQ selects reads with probability of no incorrect base above a threshold. In the single-end experiments, we used Escherichia coli and Bacillus cereus datasets of MiSeq, Velvet assembler for genome assembly, and GAGE benchmark tools for result evaluation. In the paired-end experiments, we used the giant grouper (Epinephelus lanceolatus) dataset of HiSeq, ALLPATHS-LG genome assembler, and QUAST quality assessment tool for comparing genome assemblies of the original set and the subset. The results show that subset selection not only can speed up the genome assembly but also can produce substantially longer scaffolds. Availability: The software is freely available at https://github.com/moneycat/QReadSelector.

Highlights

Recent progress in next-generation sequencing technology has afforded several improvements such as ultra-high throughput at low cost, very high read quality, and substantially increased sequencing depth
Subset selection for single-end reads Here, we propose two strategies to select a subset of reads based on quality value of each base
We proposed the subset selection problem of highdepth reads for de novo genome assembly and developed two selection strategies, MinimalQ and ProductQ, to select subsets of reads and paired ends

Summary

Introduction

Recent progress in next-generation sequencing technology has afforded several improvements such as ultra-high throughput at low cost, very high read quality, and substantially increased sequencing depth. State-ofthe-art high-throughput sequencers, such as the Illumina MiSeq system, can generate ~15 Gbp sequencing data per run, with >80% bases above Q30 and a sequencing depth of up to several 1000x for small genomes. To produce longer contigs and scaffolds, sequencing data with sufficient sequencing the HiSeq data is often 100x or more for large genomes. The availability of such high sequencing depth and highquality reads leads us to wonder if it is possible to select useful reads and read pairs from the original sequencing data, in order to assemble genomes without affecting assembly results or with even better results

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC genomics	Publication Date: Dec 1, 2015
Citations: 14	License type: cc-by

R Discovery Prime

R Discovery Prime

Subset selection of high-depth next generation sequencing reads for de novo genome assembly using MapReduce framework.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC genomics

Lead the way for us

Similar Papers

An Improved Genome Sequence Resource of Bipolaris maydis, Causal Agent of Southern Corn Leaf Blight.
Yafei Wang ... Wende Liu
Phytopathology® | VOL. 112
Yafei Wang, et. al.Yafei Wang ... Wende Liu
29 Apr 2022
Phytopathology® | VOL. 112

Identification of optimum sequencing depth especially for de novo genome assembly of small genomes using next generation sequencing data.
Aarti Desai ... Abhay Jere
PLoS ONE | VOL. 8
Aarti Desai, et. al.Aarti Desai ... Abhay Jere
12 Apr 2013
PLoS ONE | VOL. 8

Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing
Justin M Zook ... Marc Salit
PLoS ONE | VOL. 7
Justin M Zook, et. al.Justin M Zook ... Marc Salit
31 Jul 2012
PLoS ONE | VOL. 7

Optimizing hybrid assembly of next-generation sequence data from Enterococcus faecium: a microbe with highly divergent genome
Yajun Wang ... Xuan Li
BMC Systems Biology | VOL. 6
Yajun Wang, et. al.Yajun Wang ... Xuan Li
01 Dec 2012
BMC Systems Biology | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Subset selection of high-depth next generation sequencing reads for de novo genome assembly using MapReduce framework.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC genomics