Abstract

BackgroundWith the rapid development of accurate sequencing and assembly technologies, an increasing number of high-quality chromosome-level and haplotype-resolved assemblies of genomic sequences have been derived, from which there will be great opportunities for computational pangenomics. Although genome graphs are among the most useful models for pangenome representation, their structural complexity makes it difficult to present genome information intuitively, such as the linear reference genome. Thus, efficiently and accurately analyzing the genome graph spatial structure and coordinating the information remains a substantial challenge.ResultsWe developed a new method, a colored superbubble (cSupB), that can overcome the complexity of graphs and organize a set of species- or population-specific haplotype sequences of interest. Based on this model, we propose a tri-tuple coordinate system that combines an offset value, topological structure and sample information. Additionally, cSupB provides a novel method that utilizes complete topological information and efficiently detects small indels (< 50 bp) for highly similar samples, which can be validated by simulated datasets. Moreover, we demonstrated that cSupB can adapt to the complex cycle structure.ConclusionsAlthough the solution is made suitable for increasingly complex genome graphs by relaxing the constraint, the directed acyclic graph, the motif cSupB and the cSupB method can be extended to any colored directed acyclic graph. We anticipate that our method will facilitate the analysis of individual haplotype variants and population genomic diversity. We have developed a C + + program for implementing our method that is available at https://github.com/eggleader/cSupB.

Highlights

  • BackgroundWithin a certain species, individual genomes vary in both the gene content and genomic portions of DNA sequences

  • We developed an efficient algorithm to identify each specific graph spatial structure, called a colored superbubble, and organized these cSupBs into a tree that accurately reflects their inclusion relationships depicted in the colored de Bruijn graph we constructed

  • The results show that, on the one hand, the tri-tuple coordinate system can accommodate the existing linear reference coordinates and annotation data; on the other hand, the variants detected from the cSupBs are more comprehensive and diverse

Read more

Summary

Introduction

BackgroundWithin a certain species, individual genomes vary in both the gene content and genomic portions of DNA sequences. Because of the rapid development of accurate long-read sequencing and assembly technologies [1, 2], Guo et al BMC Bioinformatics (2021) 22:282 for many species, abundant high-quality chromosome- and haplotype-resolved assemblies of species- or population-specific genomes have been derived, thereby accelerating the coming of the population genome era [3]. A pangenome provides a complete picture of genomes and complex genomic variants within a species of interest and provides an opportunity for the development of efficient computational methods and various promising applications in medical biology [6], ecology [7], and evolutionary biology [8]. With the rapid development of accurate sequencing and assembly technologies, an increasing number of high-quality chromosome-level and haplotyperesolved assemblies of genomic sequences have been derived, from which there will be great opportunities for computational pangenomics. Efficiently and accurately analyzing the genome graph spatial structure and coordinating the information remains a substantial challenge

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call