The Evidential Statistics of Genetic Assembly: Bootstrapping a Reference Sequence

Yukihiko Toquenaga,Takuya Gagné

doi:10.3389/fevo.2021.614374

Abstract

The reference sequences play an essential role in genome assembly, like type specimens in taxonomy. Those references are also samples obtained at some time and location with a specific method. How can we evaluate or discriminate uncertainties of the reference itself and assembly methods? Here we bootstrapped 50 random read data sets from a small circular genome of aEscherichia colibacteriophage, phiX174, and tried to reconstruct the reference with 14 free assembly programs. Nine out of 14 assembly programs were capable of circular genome reconstruction. Unicycler correctly reconstructed the reference for 44 out of 50 data sets, but each reconstructed contig of the failed six data sets had minor defects. The other assembly software could reconstruct the reference with minor defects. The defect regions differed among the assembly programs, and the defect locations were far from randomly distributed in the reference genome. All contigs of Trinity included one, but Minia had two perfect copies other than an imperfect reference copy. The centroid of contigs for assembly programs except Unicycler differed from the reference with 75bases at most. Nonmetric multidimensional scaling (NMDS) plots of the centroids indicated that even the reference sequence was located slightly off from the estimated location of the true reference. We propose that the combination of bootstrapping a reference, making consensus contigs as centroids in an edit distance, and NMDS plotting will provide an evidential statistic way of genetic assembly for non-fragmented base sequences.

Highlights

Assume that you got multiple non-fragmented base sequences assembled from data generated with next-generation sequencing (NGS) or more advanced methods
We propose that bootstrapping can evaluate and characterize assembly programs by using distance measurements that play essential roles in the evidential statistics (Lindsay, 2004)
Most contigs generated with assembly software were longer than the reference sequence

Summary

Introduction

Assume that you got multiple non-fragmented base sequences assembled from data generated with next-generation sequencing (NGS) or more advanced methods. We further assume that we do not have available reference sequences for the material. We propose an evidential statistical method for inferring true sequence by bootstrapping and Nonmetric Multidimensional Scaling (NMDS) plotting with the assembled non-fragmented sequences. We never seek the true model for a specific data set. The true model for a specific data set still plays an essential role in biology using base sequence data. DNA or RNA sequencing is rather conservative. It relies on reference sequences often obtained with decades-old sequence techniques (e.g., Sung, 2017). Those that resemble the reference are promising candidates. The reference sequences play the role of type specimens in taxonomical identification (Ballouz et al, 2019)

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Ecology and Evolution	Publication Date: Jul 26, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

The Evidential Statistics of Genetic Assembly: Bootstrapping a Reference Sequence

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Ecology and Evolution

Lead the way for us

Similar Papers

The revised reference genome of the leopard gecko (Eublepharis macularius) provides insight into the considerations of genome phasing and assembly.
Brendan J Pinto ... Arun Sethuraman
Journal of Heredity | VOL. 114
Brendan J Pinto, et. al.Brendan J Pinto ... Arun Sethuraman
03 Mar 2023
Journal of Heredity | VOL. 114

Diversity of Bacterial and Fungal Communities in Wheat Straw Compost for Agaricus bisporus Cultivation
Guangtian Cao ... Weiming Cai
HortScience | VOL. 54
Guangtian Cao, et. al.Guangtian Cao ... Weiming Cai
01 Jan 2019
HortScience | VOL. 54

Improved Assembly of Metagenome-Assembled Genomes and Viruses in Tibetan Saline Lake Sediment by HiFi Metagenomic Sequencing.
Ye Tao ... Jinxin Liu
Microbiology Spectrum | VOL. 11
Ye Tao, et. al.Ye Tao ... Jinxin Liu
08 Dec 2022
Microbiology Spectrum | VOL. 11

Regulator Of G Protein Signaling 14 Disruption Affects The Gut Microbiota And Metabolome In Mice
Candace R Longoria ... Sara C Campbell
Medicine & Science in Sports & Exercise | VOL. 54
Candace R Longoria, et. al.Candace R Longoria ... Sara C Campbell
01 Sep 2022
Medicine & Science in Sports & Exercise | VOL. 54

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Evidential Statistics of Genetic Assembly: Bootstrapping a Reference Sequence

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Ecology and Evolution