Abstract

BackgroundThanks to the recent advancements in next-generation sequencing (NGS) technologies, large amount of genomic data, which are short DNA sequences known as reads, has been accumulating. Diverse assemblers have been developed to generate high quality de novo assemblies using the NGS reads, but their output is very different because of algorithmic differences. However, there are not properly structured measures to show the similarity or difference in assemblies.ResultsWe developed a new measure, called the GMASS score, for comparing two genome assemblies in terms of their structure. The GMASS score was developed based on the distribution pattern of the number and coverage of similar regions between a pair of assemblies. The new measure was able to show structural similarity between assemblies when evaluated by simulated assembly datasets. The application of the GMASS score to compare assemblies in recently published benchmark datasets showed the divergent performance of current assemblers as well as its ability to compare assemblies.ConclusionThe GMASS score is a novel measure for representing structural similarity between two assemblies. It will contribute to the understanding of assembly output and developing de novo assemblers.

Highlights

  • Thanks to the recent advancements in next-generation sequencing (NGS) technologies, large amount of genomic data, which are short DNA sequences known as reads, has been accumulating

  • Starting from an ancestral form of an assembly, descendent assemblies in the dataset were simulated with different evolutionary divergence which determined the amount of perturbation in the assembly simulation process

  • The GMASS score is a novel measure for representing structural similarity between two assemblies

Read more

Summary

Introduction

Thanks to the recent advancements in next-generation sequencing (NGS) technologies, large amount of genomic data, which are short DNA sequences known as reads, has been accumulating. Several algorithms have been developed to generate high-quality de novo assemblies They are mainly classified into three categories: the greedy graph based algorithm, the Overlap-Layout-Consensus based algorithm, and the de Bruijn graph based algorithm [4]. The overlap scores are calculated using the number of matching bases in the overlap Both the Overlap-Layout-Consensus based algorithm and de Brujin graph-based algorithm rely on a graph structure constructed from the NGS reads. The Overlap-Layout-Consensus based algorithm, such as CABOG [8], Newbler [9] and Celera assemblers [10], constructs an overlap graph using the direct overlap among the NGS reads, whereas the de Bruijn graph based algorithm, such as ABySS [11], SOAPdenovo [12], ALLPATHS-LG [13] and Velvet [14], Kwon et al BMC Bioinformatics (2019) 20:147 is based on the overlap of all possible subsequences of length k, known as k-mer, extracted from the NGS reads

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.