Abstract

The reference sequences play an essential role in genome assembly, like type specimens in taxonomy. Those references are also samples obtained at some time and location with a specific method. How can we evaluate or discriminate uncertainties of the reference itself and assembly methods? Here we bootstrapped 50 random read data sets from a small circular genome of aEscherichia colibacteriophage, phiX174, and tried to reconstruct the reference with 14 free assembly programs. Nine out of 14 assembly programs were capable of circular genome reconstruction. Unicycler correctly reconstructed the reference for 44 out of 50 data sets, but each reconstructed contig of the failed six data sets had minor defects. The other assembly software could reconstruct the reference with minor defects. The defect regions differed among the assembly programs, and the defect locations were far from randomly distributed in the reference genome. All contigs of Trinity included one, but Minia had two perfect copies other than an imperfect reference copy. The centroid of contigs for assembly programs except Unicycler differed from the reference with 75bases at most. Nonmetric multidimensional scaling (NMDS) plots of the centroids indicated that even the reference sequence was located slightly off from the estimated location of the true reference. We propose that the combination of bootstrapping a reference, making consensus contigs as centroids in an edit distance, and NMDS plotting will provide an evidential statistic way of genetic assembly for non-fragmented base sequences.

Highlights

  • Assume that you got multiple non-fragmented base sequences assembled from data generated with next-generation sequencing (NGS) or more advanced methods

  • We propose that bootstrapping can evaluate and characterize assembly programs by using distance measurements that play essential roles in the evidential statistics (Lindsay, 2004)

  • Most contigs generated with assembly software were longer than the reference sequence

Read more

Summary

Introduction

Assume that you got multiple non-fragmented base sequences assembled from data generated with next-generation sequencing (NGS) or more advanced methods. We further assume that we do not have available reference sequences for the material. We propose an evidential statistical method for inferring true sequence by bootstrapping and Nonmetric Multidimensional Scaling (NMDS) plotting with the assembled non-fragmented sequences. We never seek the true model for a specific data set. The true model for a specific data set still plays an essential role in biology using base sequence data. DNA or RNA sequencing is rather conservative. It relies on reference sequences often obtained with decades-old sequence techniques (e.g., Sung, 2017). Those that resemble the reference are promising candidates. The reference sequences play the role of type specimens in taxonomical identification (Ballouz et al, 2019)

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.