Abstract

Sequence assembling is an important step for bioinformatics study. With the help of next generation sequencing (NGS) technology, high throughput DNA fragment (reads) can be randomly sampled from DNA or RNA molecular sequence. However, as the positions of reads being sampled are unknown, assembling process is required for combining overlapped reads to reconstruct the original DNA or RNA sequence. Compared with traditional Sanger sequencing methods, although the throughput of NGS reads increases, the read length is shorter and the error rate is higher. It introduces several problems in assembling. Moreover, paired-end reads instead of single-end reads can be sampled which contain more information. The existing assemblers cannot fully utilize this information and fails to assemble longer contigs. In this article, we will revisit the major problems of assembling NGS reads on genomic, transcriptomic, metagenomic and metatranscriptomic data. We will also describe our IDBA package for solving these problems. IDBA package has adopted several novel ideas in assembling, including using multiple k, local assembling and progressive depth removal. Compared with existence assemblers, IDBA has better performance on many simulated and real sequencing datasets.

Highlights

  • Deoxyribonucleic acid (DNA) is a sequence of nucleotides adenine (A), cytosine (C), guanine (G) and thymine (T) which is used to encode all genetic information for controlling development and functioning of most organisms in the world except some virus

  • Assembling step analyzes the set of fragments sampled from unknown locations and determines the original DNA or ribonucleic acid (RNA) sequence

  • We will explain the general problems of assembling next generation sequencing (NGS) data (Section 2) and the possible solutions introduced by IDBA package (Section 3)

Read more

Summary

Existing approaches

Since the location of each read in the DNA or RNA sequence is unknown, assembling process is needed to combine these reads into the original sequence. There are three major approaches for assembling reads: (i) overlap-and-extend, (ii) string graph, and (iii) de Bruijn graph. Extension of IDBA for assembling prokaryotic metatranscriptomic data It assembles reads by applying known protein reference sequences. Similar to the overlap-and-extend approach, since the data structure used for storing the string graph is large, it requires a large amount of memory and takes a long time for finding overlapped reads. When all reads are error-free and the number of sequenced reads is large compared with the genome length (high sequencing depth), both string graph and de Bruijn graph approaches work well. Because of the existence of erroneous reads and the repeated patterns exist in the genome, these two approaches may not perform well on some sequencing data

False positive vertices
Gap problem
Branching problem
Under utilization of paired-end reads information
Solution for assembling NGS reads
Multiple k
D Low Short No
Local assembling
Reads and contigs correction
Problems and solutions for assembling transcriptomic data
Solutions for assembling transcriptomic data
Problems of assembling metagenomic data
Solutions for assembling metagenomic data
Problems and solutions for assembling metatranscriptomic data
Problems of assembling metatranscriptomic data
Solutions for assembling metatranscriptomic data
Experimental results
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.