Abstract

Next-generation sequencing (NGS) offers a powerful opportunity to identify low-abundance, intra-host viral sequence variants, yet the focus of many bioinformatic tools on consensus sequence construction has precluded a thorough analysis of intra-host diversity. To take full advantage of the resolution of NGS data, we developed HAplotype PHylodynamics PIPEline (HAPHPIPE), an open-source tool for the de novo and reference-based assembly of viral NGS data, with both consensus sequence assembly and a focus on the quantification of intra-host variation through haplotype reconstruction. We validate and compare the consensus sequence assembly methods of HAPHPIPE to those of two alternative software packages, HyDRA and Geneious, using simulated HIV and empirical HIV, HCV, and SARS-CoV-2 datasets. Our validation methods included read mapping, genetic distance, and genetic diversity metrics. In simulated NGS data, HAPHPIPE generated pol consensus sequences significantly closer to the true consensus sequence than those produced by HyDRA and Geneious and performed comparably to Geneious for HIV gp120 sequences. Furthermore, using empirical data from multiple viruses, we demonstrate that HAPHPIPE can analyze larger sequence datasets due to its greater computational speed. Therefore, we contend that HAPHPIPE provides a more user-friendly platform for users with and without bioinformatics experience to implement current best practices for viral NGS assembly than other currently available options.

Highlights

  • Next-generation sequence (NGS) data provide a new opportunity to more efficiently study viral diversity, especially within-host sequence variation, which is key to understanding the evolutionary dynamics of viral populations both within and amongst hosts

  • We found that NGS viral analysis is improved with the use of HAPHPIPE, in conserved regions

  • We demonstrated that de novo assembly performs better than reference-based assembly at generating a consensus sequence that is closer to the true sequence

Read more

Summary

Introduction

Next-generation sequence (NGS) data provide a new opportunity to more efficiently study viral diversity, especially within-host sequence variation, which is key to understanding the evolutionary dynamics of viral populations both within and amongst hosts. NGS provides an opportunity to better explore viral sequence evolution over time [1] and evolution among hosts, including the direction of cross-species transmission [2], or elucidate the origin of viral epidemics [3]. While some studies capitalize on the ability of NGS data to identify intra-host sequence variants, the majority rely on consensus sequence estimation. This results in a loss of resolution in intra-patient viral diversity, Viruses 2020, 12, 758; doi:10.3390/v12070758 www.mdpi.com/journal/viruses. For reference-based assembly, sequencing reads are aligned (or mapped) to a reference sequence and a consensus sequence is generated, often using majority rule, where the most frequently encountered nucleotide at each aligned position is chosen to be the nucleotide in the consensus sequence at that same position

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call