Assembly scaffolding with PE-contaminated mate-pair libraries.

Kristoffer Sahlin,Lars Arvestad,Rayan Chikhi

doi:10.1093/bioinformatics/btw064

Abstract

Scaffolding is often an essential step in a genome assembly process, in which contigs are ordered and oriented using read pairs from a combination of paired-end libraries and longer-range mate-pair libraries. Although a simple idea, scaffolding is unfortunately hard to get right in practice. One source of problems is so-called PE-contamination in mate-pair libraries, in which a non-negligible fraction of the read pairs get the wrong orientation and a much smaller insert size than what is expected. This contamination has been discussed before, in relation to integrated scaffolders, but solutions rely on the orientation being observable, e.g. by finding the junction adapter sequence in the reads. This is not always possible, making orientation and insert size of a read pair stochastic. To our knowledge, there is neither previous work on modeling PE-contamination, nor a study on the effect PE-contamination has on scaffolding quality. We have addressed PE-contamination in an update to our scaffolder BESST. We formulate the problem as an integer linear program which is solved using an efficient heuristic. The new method shows significant improvement over both integrated and stand-alone scaffolders in our experiments. The impact of modeling PE-contamination is quantified by comparing with the previous BESST model. We also show how other scaffolders are vulnerable to PE-contaminated libraries, resulting in an increased number of misassemblies, more conservative scaffolding and inflated assembly sizes. The model is implemented in BESST. Source code and usage instructions are found at https://github.com/ksahlin/BESST BESST can also be downloaded using PyPI. ksahlin@kth.se Supplementary data are available at Bioinformatics online.

Highlights

Genome assembly is still a challenging process, especially for large genomes, and scientists experiment with different combinations of data and tools to reduce errors, improve contiguity, and avoid ambiguity
The presented work builds on BESST scaffolder (Sahlin et al, 2014), which iterates over PE and/or MP libraries in the order of their mean insert size
We will here define the problem of ordering and positioning of contigs as an Integer Linear Program (ILP) and use heuristic permutation of the contig ordering to efficiently find an assignment with good objective value

Summary

Introduction

Genome assembly is still a challenging process, especially for large genomes, and scientists experiment with different combinations of data and tools to reduce errors, improve contiguity, and avoid ambiguity. An important step in the assembly process is scaffolding, in which contigs are ordered and oriented, and joined to form a larger scaffold unit. The input to a scaffolder is both large and noisy, and the data characteristics can vary a lot depending on the organism and assembler. Contiguity and errors are the most important metrics to evaluate a scaffolder by, we came to note that there are other artifacts from scaffolders not reported in these metrics. We observed that assemblies could increase in size with up to 106% after scaffolding and this mostly affects fragmented assemblies. A successful scaffolding will have some assembly inflation due to, e.g., unsequenced regions, but it should in general be very small

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Bioinformatics	Publication Date: Mar 2, 2016
Citations: 42	License type: cc-by

R Discovery Prime

R Discovery Prime

Assembly scaffolding with PE-contaminated mate-pair libraries.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Similar Papers

AlienTrimmer removes adapter oligonucleotides with high sensitivity in short-insert paired-end reads. Commentary on Turner (2014) Assessment of insert sizes and adapter content in FASTQ data from NexteraXT libraries.
Alexis Criscuolo ... Sylvain Brisse
Frontiers in Genetics | VOL. 5
Alexis Criscuolo, et. al.Alexis Criscuolo ... Sylvain Brisse
13 May 2014
Frontiers in Genetics | VOL. 5

Improving mammalian genome scaffolding using large insert mate-pair next-generation sequencing
Sebastiaan Van Heesch ... Frans-Paul Ruzius
BMC Genomics | VOL. 14
Sebastiaan Van Heesch, et. al.Sebastiaan Van Heesch ... Frans-Paul Ruzius
16 Apr 2013
BMC Genomics | VOL. 14

Next-Generation Sequencing Strategies Enable Routine Detection of Balanced Chromosome Rearrangements for Clinical Diagnostics and Genetic Research
Michael E Talkowski ... James F Gusella
The American Journal of Human Genetics | VOL. 88
Michael E Talkowski, et. al.Michael E Talkowski ... James F Gusella
01 Apr 2011
The American Journal of Human Genetics | VOL. 88

High efficiency application of a mate-paired library from next-generation sequencing to postlight sequencing: Corynebacterium pseudotuberculosis as a case study for microbial de novo genome assembly
Rommel Thiago Jucá Ramos ... Artur Silva
Journal of Microbiological Methods | VOL. 95
Rommel Thiago Jucá Ramos, et. al.Rommel Thiago Jucá Ramos ... Artur Silva
21 Jun 2013
Journal of Microbiological Methods | VOL. 95

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Assembly scaffolding with PE-contaminated mate-pair libraries.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics