Abstract

BackgroundPaired-tag sequencing approaches are commonly used for the analysis of genome structure. However, mammalian genomes have a complex organization with a variety of repetitive elements that complicate comprehensive genome-wide analyses.ResultsHere, we systematically assessed the utility of paired-end and mate-pair (MP) next-generation sequencing libraries with insert sizes ranging from 170 bp to 25 kb, for genome coverage and for improving scaffolding of a mammalian genome (Rattus norvegicus). Despite a lower library complexity, large insert MP libraries (20 or 25 kb) provided very high physical genome coverage and were found to efficiently span repeat elements in the genome. Medium-sized (5, 8 or 15 kb) MP libraries were much more efficient for genome structure analysis than the more commonly used shorter insert paired-end and 3 kb MP libraries. Furthermore, the combination of medium- and large insert libraries resulted in a 3-fold increase in N50 in scaffolding processes. Finally, we show that our data can be used to evaluate and improve contig order and orientation in the current rat reference genome assembly.ConclusionsWe conclude that applying combinations of mate-pair libraries with insert sizes that match the distributions of repetitive elements improves contig scaffolding and can contribute to the finishing of draft genomes.

Highlights

  • Paired-tag sequencing approaches are commonly used for the analysis of genome structure

  • We generated seven different library insert sizes, including six libraries produced with this adapted MP protocol and one PE fragment library that was prepared in a separate experiment (Table 1)

  • Paired-end data with insert sizes up to 500 bp are commonly included in these processes, our results demonstrate that longer-range information as provided by the large insert MPs described here is essential for comprehensive genome assembly

Read more

Summary

Introduction

Paired-tag sequencing approaches are commonly used for the analysis of genome structure. Genome assemblies consist of kilobase- to megabase-sized contiguous sequences of DNA (contigs) that need to be positioned in a correct order and orientation This ordering of contigs (scaffolding) requires long-range structural information that reaches beyond the boundaries of contigs. Common repetitive elements [like LINE (L1) elements] in vertebrate genomes can span as much as 8 kb in size (Additional file 1) [7,14], illustrating the need for longer range information for comprehensive analysis of genome structures To this end, various bioinformatic algorithms like CREST [15] and ALLPATHS-LG [16] have been developed to increase effective PE read span by systematically merging overlapping sequences. A systematic assessment of the utility and combination of different library insert sizes for resolving existing assembly difficulties in complex regions of genomes is currently lacking

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call