Abstract

Long-read, single-molecule DNA sequencing technologies have triggered a revolution in genomics by enabling the determination of large, reference-quality genomes in ways that overcome some of the limitations of short-read sequencing. However, the greater length and higher error rate of the reads generated on long-read platforms make the tools used for assembling short reads unsuitable for use in data assembly and motivate the development of new approaches. We present LoReTTA (Long Read Template-Targeted Assembler), a tool designed for performing de novo assembly of long reads generated from viral genomes on the PacBio platform. LoReTTA exploits a reference genome to guide the assembly process, an approach that has been successful with short reads. The tool was designed to deal with reads originating from viral genomes, which feature high genetic variability, possible multiple isoforms, and the dominant presence of additional organisms in clinical or environmental samples. LoReTTA was tested on a range of simulated and experimental datasets and outperformed established long-read assemblers in terms of assembly contiguity and accuracy. The software runs under the Linux operating system, is designed for easy adaptation to alternative systems, and features an automatic installation pipeline that takes care of the required dependencies. A command-line version and a user-friendly graphical interface version are available under a GPLv3 license at https://bioinformatics.cvr.ac.uk/software/ with the manual and a test dataset.

Highlights

  • DNA sequencing has prompted a period of explosive growth in the genomics of microorganisms

  • We present LoReTTA (Long Read Template-Targeted Assembler), a tool designed for performing de novo assembly of long reads generated from viral genomes on the PacBio platform

  • Focusing on long-read data generated on the PacBio platform, we found that established assemblers were not successful at reconstructing human cytomegalovirus (HCMV) genomes, for the reasons outlined above

Read more

Summary

Introduction

DNA sequencing has prompted a period of explosive growth in the genomics of microorganisms. Recombination between the inverted repeats in concatemeric genomes during DNA replication followed by cleavage of unit-length genomes from concatemers lead to the co-existence of equimolar amounts of four isoforms differing in the relative orientations of UL and US (McVoy and Adler 1994) (Fig. 1) These structural features are largely invisible on the scale of short reads, but their representation in long reads can prematurely terminate assembly or introduce artefactual duplications. The software was originally designed to deal with HCMV, it proved successful at assembling a range of viral genomes

Read datasets
Benchmarking
Reference genome subsampling
Local de Novo assembly
Genome reconstruction
Consensus calling
Effects of genome size
Effects of genome isoforms
Effects of datasets derived from simple mixtures
Effects of datasets derived from complex mixtures
Effects of the reference genome
Software evaluation on experimental datasets
F Re L R C F Re
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.