Abstract

BackgroundEfficient and effective genome scaffolding tools are still in high demand for generating reference-quality assemblies. While long read data itself is unlikely to create a chromosome-scale assembly for most eukaryotic species, the inexpensive Hi-C sequencing technology, capable of capturing the chromosomal profile of a genome, is now widely used to complete the task. However, the existing Hi-C based scaffolding tools either require a priori chromosome number as input, or lack the ability to build highly continuous scaffolds.ResultsWe design and develop a novel Hi-C based scaffolding tool, pin_hic, which takes advantage of contact information from Hi-C reads to construct a scaffolding graph iteratively based on N-best neighbors of contigs. Subsequent to scaffolding, it identifies potential misjoins and breaks them to keep the scaffolding accuracy. Through our tests on three long read based de novo assemblies from three different species, we demonstrate that pin_hic is more efficient than current standard state-of-art tools, and it can generate much more continuous scaffolds, while achieving a higher or comparable accuracy.ConclusionsPin_hic is an efficient Hi-C based scaffolding tool, which can be useful for building chromosome-scale assemblies. As many sequencing projects have been launched in the recent years, we believe pin_hic has potential to be applied in these projects and makes a meaningful contribution.

Highlights

  • Efficient and effective genome scaffolding tools are still in high demand for generating reference-quality assemblies

  • We proposed a new Hi-C scaffolding method for generating chromosome-scale scaffolds through iterative weighted linking, it uses N-best neighbor strategy to resolve non-reciprocal best neighbor issue and exploit all possible links, and a robust method to discover misjoins in scaffolds and improve scaffolding accuracy based on comparison of maximum physical coverages of the joins and their neighboring contigs, which is theoretically and practically unaffected by the scaffold lengths

  • We defined a novel “SAT” format to keep a scaffolding graph, which can be used in further genomic analysis, such as manual curation

Read more

Summary

Results

To assess the performance of pin_hic, we conducted three experiments on three different VGP assemblies, and compared our results with the state-of-art scaffolding tool SALSA2 and 3D-DNA. Scaffolding results evaluation The scaffolding results are shown, in which the best results are highlighted in bold In the experiments, both SALSA2 and 3D-DNA were run in default settings without error correction before scaffolding. Pin_hic was run in default settings which uses three iterations, summation normalization and three-part split method We assessed their accuracy using QUAST-LG [20] with the chromosome-scale assemblies mentioned in the last section. In these figures, scaffolds consisting of 90% of the reference genome size, are selected to map to the VGP fAnaTes1.2 assembly, the chromosomes of the assembly are displayed on the left side and the scaffolds on the right, the interrupting ribbons are the visible mis-assemblies. To balance scaffolding efficiency, scaffolding correctness and continuity, pin_hic use three-part split and summation normalization method as its default mode. Pin_hic iteration performances for the other two scaffolds are demonstrated in Additional file 1: Fig. S6

Conclusions
Background
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call