Abstract

Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6–40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.

Highlights

  • Sanger sequencing was used to sequence the genomes of organisms of interest

  • There has been an increase in the number of de novo genome assemblies generated using Next Generation Sequencing (NGS) data [8;27] as well as the number of assemblers available to assemble this data [27;28;32]

  • All of the short read assemblers are based on De bruijn graph approach and a number of studies evaluating the performance of these assembly algorithms for genomes of different sizes have been published in recent years [22;23;24]

Read more

Summary

Introduction

Sanger sequencing was used to sequence the genomes of organisms of interest. Using Sanger sequencing technology, the human genome was sequenced at 6–8X average coverage and cost of about $ 2.7 billion and required efforts from over 3000 scientists from 6 different countries [3]. The complexity, cost and time involved in the human genome project, highlighted the dire need for the development of sequencers with higher throughput and lower cost of sequencing. This need culminated in the development of multiple high throughput or massively parallel sequencing technologies collectively referred to as the Generation Sequencing (NGS) technologies. The cost of sequencing on NGS systems is much lower as compared to the automated Sanger sequencing method. According to the data released by the National Human Genome Research Institute, the cost of sequencing a human sized genome using the NGS technology is a little less than $10000 and this includes library preparation, sequencing and data analysis (http://www.genome. gov/sequencingcosts/)

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call