Abstract

A fundamental problem in bioinformatics is genome assembly. Next-generation sequencing (NGS) technologies produce large volumes of fragmented genome reads, which require large amounts of memory to assemble the complete genome efficiently. With recent improvements in DNA sequencing technologies, it is expected that the memory footprint required for the assembly process will increase dramatically and will emerge as a limiting factor in processing widely available NGS-generated reads. In this report, we compare current memory-efficient techniques for genome assembly with respect to quality, memory consumption and execution time. Our experiments prove that it is possible to generate draft assemblies of reasonable quality on conventional multi-purpose computers with very limited available memory by choosing suitable assembly methods. Our study reveals the minimum memory requirements for different assembly programs even when data volume exceeds memory capacity by orders of magnitude. By combining existing methodologies, we propose two general assembly strategies that can improve short-read assembly approaches and result in reduction of the memory footprint. Finally, we discuss the possibility of utilizing cloud infrastructures for genome assembly and we comment on some findings regarding suitable computational resources for assembly.

Highlights

  • Genome assembly is a fundamental problem in sequence bioinformatics [1] and many assemblers have been developed up to now [2,3,4,5,6,7,8,9,10,11,12,13]

  • Current Next-Generation Sequencing (NGS) technologies deliver the following significant improvements over older methods [14]: (i) the read length has increased to several hundreds or even thousands of base pairs for single-molecule, real-time sequencing; (ii) genome coverage has increased by orders of magnitude; (iii) the sequencing process has become much faster and much cheaper [15]; (iv) whole genome sequencing (WGS) for every organism has become feasible [16]; (v) metagenomics assembly from environmental samples has become possible [17]

  • Our results show that DiMA is a general strategy for reducing the memory requirements of traditional assemblers

Read more

Summary

Introduction

Genome assembly is a fundamental problem in sequence bioinformatics [1] and many assemblers have been developed up to now [2,3,4,5,6,7,8,9,10,11,12,13]. The input for genome assembly is generated using the Next-Generation Sequencing (NGS) technologies. A side effect of NGS is the massive amount of generated raw data that normally requires computers with very large memories for the assembly process. Traditional short-read assemblers require around 256 GB RAM for datasets with roughly 500 million reads [18]. This problem is expected to worsen in the future because the NGS data generation rate has exceeded expectations based on Moore’s law [19], meaning that the amount of raw data is expected to grow much faster than the capacity of available memory. Despite the practical significance of the problem, existing reviews [1,20,21,22] and comparison studies like Assemblathon [23] and GAGE [18], have focused on the quality of the assembly, but not on memory requirements

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.