Graph Theoretical Strategies in De Novo Assembly

Kimia Behizadi,Ali Iranmanesh,Nafiseh Jafarzadeh

doi:10.1109/access.2022.3144113

Kimia Behizadi, Ali Iranmanesh + Show 1 more

Open Access

https://doi.org/10.1109/access.2022.3144113

Copy DOI

Abstract

De novo genome assemblers assume the reference genome is unavailable, incomplete, highly fragmented, or significantly altered as in cancer tissues. Algorithms for de novo assembly have been developed to deal with and assemble a large number of short sequence reads from genome sequencing. In this review paper, we have provided an overview of the graph-theoretical side of de novo genome assembly algorithms. We have investigated the construction of fourteen graph data structures related to OLC-based and DBG-based algorithms in order to compare and discuss their application in different assemblers. In addition, the most significant and recent genome de novo assemblers are classified according to the extensive variety of original, generalized, and specialized versions of graph data structures.

Highlights

Since the completion of the human genome project at the turn of the century, there has been an unprecedented expansion of genomic sequence data
It can be estimated that 43% of de novo assemblers on high throughput sequencing (HTS) are based on the OLC approach and 57% are based on the deBruijn graph (DBG) approach
Overlap graph and string graph data structures lead to finding a Hamiltonian path which is known as an NP-complete problem, but they are more suitable than the de-Bruijn graphs for long sequences and singlemolecule sequencing reads of high error rate

Summary

Introduction

Since the completion of the human genome project at the turn of the century, there has been an unprecedented expansion of genomic sequence data. The de novo genome assembly is one of the big data challenges in bioinformatics to reconstruct a genome from a collection of short sequencing reads without the aid of a reference genome. There are three generations of genome sequencing technologies. The first technology, so-called Sanger sequencing, was developed in 1977 [1], [2] This technology is a very expensive cost and low throughput technique but it was used to obtain the first human genome sequence. Second-generation sequencing, so-called next-generation sequencing (NGS), is the start of high throughput sequencing (HTS) and genome sequencing is being revolutionized by the development and commercialization of HTS. The NGS technology can generate millions of short reads in parallel with a low cost of sequencing and speeding up the process compared with the

Objectives

Findings

Discussion

Conclusion