Abstract

DNA sequencing technology has been rapidly evolving, and produces a large number of short reads with a fast rising tendency. This has led to a resurgence of research in whole genome shotgun assembly algorithms. We start the assembly algorithm by clustering the short reads in a cloud computing framework, and the clustering process groups fragments according to their original consensus long-sequence similarity. We condense each group of reads to a chain of seeds, which is a kind of substring with reads aligned, and then build a graph accordingly. Finally, we analyze the graph to find Euler paths, and assemble the reads related in the paths into contigs, and then lay out contigs with mate-pair information for scaffolds. The result shows that our algorithm is efficient and feasible for a large set of reads such as in next-generation sequencing technology.

Highlights

  • The introduction of the massively parallel next-generation sequencing (NGS) technologies has caused a great increase in the number of reads typically generated by experiments

  • The whole genome shotgun (WGS) de novo assembly problem is the reconstruction of the genetic sequence information from a set of reads sequenced from the fragments

  • Discussion and future work In this paper we present methods and implementation techniques for a new clustering-based, graph-conducted assembler, named SeedsGraph, which is efficient and takes advantage of cloud computing for the large dataset of NGS data

Read more

Summary

Introduction

The introduction of the massively parallel next-generation sequencing (NGS) technologies has caused a great increase in the number of reads typically generated by experiments. The shorter read length from NGS and the sheer demand for more scalable assemblers have been an important computational challenge, and the genome assembly continues to represent one of the most difficult and important algorithmic problems in bioinformatics. Software technology and algorithm implementation become critical factors when dealing with terabytes of data. Cloud computing as a brand new way of dealing with an extremely large dataset offers a good chance for bioinformatics data processing. The ability and feasibility for underlying applications have been discussed [1,2]. We design a graph-based method for the NGS reads assembly problem and implement it as a software package, SeedsGraph. In the Background section, the NGS reads assembly problem and the framework for cloud computing are discussed.

Background
9: Save T to HDFS for next job
Findings
Result
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.