OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees.

Song Gao,Burton K H Chia,Niranjan Nagarajan,Denis Bertrand

doi:10.1186/s13059-016-0951-y

Abstract

The assembly of large, repeat-rich eukaryotic genomes represents a significant challenge in genomics. While long-read technologies have made the high-quality assembly of small, microbial genomes increasingly feasible, data generation can be expensive for larger genomes. OPERA-LG is a scalable, exact algorithm for the scaffold assembly of large, repeat-rich genomes, out-performing state-of-the-art programs for scaffold correctness and contiguity. It provides a rigorous framework for scaffolding of repetitive sequences and a systematic approach for combining data from different second-generation and third-generation sequencing technologies. OPERA-LG provides an avenue for systematic augmentation and improvement of thousands of existing draft eukaryotic genome assemblies.Electronic supplementary materialThe online version of this article (doi:10.1186/s13059-016-0951-y) contains supplementary material, which is available to authorized users.

Highlights

The field of sequence assembly has witnessed a significant amount of mathematical and algorithmic study of the problem [1,2,3,4]
Overview The algorithmic core of OPERA-LG is adapted from the approach described in Gao et al [13] and is based on (i) a memoized search to find a scaffold that minimizes the number of discordant read-derived links connecting contigs (Additional file 1: Figure S1a), (ii) a graph contraction technique that allows for localizing the search for an optimal scaffold without losing the guarantee of a globally optimal scaffold (Additional file 1: Figure S1b), and (iii) a quadratic programming formulation to compute gap sizes that best match mate-pair-derived distance constraints [27] (Additional file 1: Figure S1c)
To enable it to produce long and accurate scaffolds for large, repeat-rich genomes, OPERA-LG incorporates several novel features and improvements, including (a) optimized data structures to improve its scalability, (b) refined edge-length estimation and the ability to simultaneously use multiple libraries to improve scaffolding accuracy, and (c) extensions that allow for the scaffolding of repeat sequences

Summary

Introduction

The field of sequence assembly has witnessed a significant amount of mathematical and algorithmic study of the problem [1,2,3,4]. As there is a wide array of heuristics and parameter choices to try, the right combination that works well across a range of datasets may not always be apparent and new assembly tools run the risk of being tuned for the datasets on which they are benchmarked. Recent assembly competitions such as GAGE [8], Assemblathon [9], Assemblathon2 [10], and a recent scaffolder benchmark [11] have played an important role in galvanizing the community and in highlighting the drawbacks of existing tools. Scaffold assembly is frequently formulated as a combinatorial graph problem and this is the approach followed in this study

Methods

Results

Discussion

Conclusion