Sequence assembly using next generation sequencing data--challenges and solutions.

Francis Y L Chin,S M Yiu,Henry C M Leung

doi:10.1007/s11427-014-4752-9

Francis Y L Chin, S M Yiu + Show 1 more

Open Access

https://doi.org/10.1007/s11427-014-4752-9

Copy DOI

Journal: Science China Life Sciences	Publication Date: Oct 17, 2014
Citations: 15	License type: cc-by

Affiliation: University of Hong Kong

Abstract

Sequence assembling is an important step for bioinformatics study. With the help of next generation sequencing (NGS) technology, high throughput DNA fragment (reads) can be randomly sampled from DNA or RNA molecular sequence. However, as the positions of reads being sampled are unknown, assembling process is required for combining overlapped reads to reconstruct the original DNA or RNA sequence. Compared with traditional Sanger sequencing methods, although the throughput of NGS reads increases, the read length is shorter and the error rate is higher. It introduces several problems in assembling. Moreover, paired-end reads instead of single-end reads can be sampled which contain more information. The existing assemblers cannot fully utilize this information and fails to assemble longer contigs. In this article, we will revisit the major problems of assembling NGS reads on genomic, transcriptomic, metagenomic and metatranscriptomic data. We will also describe our IDBA package for solving these problems. IDBA package has adopted several novel ideas in assembling, including using multiple k, local assembling and progressive depth removal. Compared with existence assemblers, IDBA has better performance on many simulated and real sequencing datasets.

Highlights

Deoxyribonucleic acid (DNA) is a sequence of nucleotides adenine (A), cytosine (C), guanine (G) and thymine (T) which is used to encode all genetic information for controlling development and functioning of most organisms in the world except some virus
Assembling step analyzes the set of fragments sampled from unknown locations and determines the original DNA or ribonucleic acid (RNA) sequence
We will explain the general problems of assembling next generation sequencing (NGS) data (Section 2) and the possible solutions introduced by IDBA package (Section 3)

Summary

Existing approaches

Since the location of each read in the DNA or RNA sequence is unknown, assembling process is needed to combine these reads into the original sequence. There are three major approaches for assembling reads: (i) overlap-and-extend, (ii) string graph, and (iii) de Bruijn graph. Extension of IDBA for assembling prokaryotic metatranscriptomic data It assembles reads by applying known protein reference sequences. Similar to the overlap-and-extend approach, since the data structure used for storing the string graph is large, it requires a large amount of memory and takes a long time for finding overlapped reads. When all reads are error-free and the number of sequenced reads is large compared with the genome length (high sequencing depth), both string graph and de Bruijn graph approaches work well. Because of the existence of erroneous reads and the repeated patterns exist in the genome, these two approaches may not perform well on some sequencing data

False positive vertices

Gap problem

Branching problem

Under utilization of paired-end reads information

Solution for assembling NGS reads

Multiple k

D Low Short No

Local assembling

Reads and contigs correction

Problems and solutions for assembling transcriptomic data

Solutions for assembling transcriptomic data

Problems of assembling metagenomic data

Solutions for assembling metagenomic data

Problems and solutions for assembling metatranscriptomic data

Problems of assembling metatranscriptomic data

Solutions for assembling metatranscriptomic data

Experimental results

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Sequence assembly using next generation sequencing data--challenges and solutions.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Science China Life Sciences

Lead the way for us

Similar Papers

Genomic Analysis of Left Ventricular Remodeling
Rizwan Sarwar ... Stuart A Cook
Circulation | VOL. 120
Rizwan Sarwar, et. al.Rizwan Sarwar ... Stuart A Cook
03 Aug 2009
Circulation | VOL. 120

Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction
Peng Zeng ... Jing Cai
Chinese medicine | VOL. 17
Peng Zeng, et. al.Peng Zeng ... Jing Cai
09 Aug 2022
Chinese medicine | VOL. 17

Next Generation Sequencing Technologies and Their Applications
Ku Chee‐Seng ... Pawitan Yudi
-
Ku Chee‐Seng, et. al.Ku Chee‐Seng ... Pawitan Yudi
19 Apr 2010
19 Apr 2010

Identification of optimum sequencing depth especially for de novo genome assembly of small genomes using next generation sequencing data.
Aarti Desai ... Veer Singh Marwah
PLoS ONE | VOL. 8
Aarti Desai, et. al.Aarti Desai ... Veer Singh Marwah
12 Apr 2013
PLoS ONE | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sequence assembly using next generation sequencing data--challenges and solutions.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Science China Life Sciences