Abstract

Next generation sequencing (NGS)-based studies have vastly increased our understanding of viral diversity. Viral sequence data obtained from NGS experiments are a rich source of information, these data can be used to study their epidemiology, evolution, transmission patterns, and can also inform drug and vaccine design. Viral genomes, however, represent a great challenge to bioinformatics due to their high mutation rate and forming quasispecies in the same infected host, bringing about the need to implement advanced bioinformatics tools to assemble consensus genomes well-representative of the viral population circulating in individual patients. Many tools have been developed to preprocess sequencing reads, carry-out de novo or reference-assisted assembly of viral genomes and assess the quality of the genomes obtained. Most of these tools however exist as standalone workflows and usually require huge computational resources. Here we present (Viral Genomes Easily Analyzed), a Snakemake workflow for analyzing RNA viral genomes. VGEA enables users to map sequencing reads to the human genome to remove human contaminants, split bam files into forward and reverse reads, carry out de novo assembly of forward and reverse reads to generate contigs, pre-process reads for quality and contamination, map reads to a reference tailored to the sample using corrected contigs supplemented by the user’s choice of reference sequences and evaluate/compare genome assemblies. We designed a project with the aim of creating a flexible, easy-to-use and all-in-one pipeline from existing/stand-alone bioinformatics tools for viral genome analysis that can be deployed on a personal computer. VGEA was built on the Snakemake workflow management system and utilizes existing tools for each step: fastp (Chen et al., 2018) for read trimming and read-level quality control, BWA (Li & Durbin, 2009) for mapping sequencing reads to the human reference genome, SAMtools (Li et al., 2009) for extracting unmapped reads and also for splitting bam files into fastq files, IVA (Hunt et al., 2015) for de novo assembly to generate contigs, shiver (Wymant et al., 2018) to pre-process reads for quality and contamination, then map to a reference tailored to the sample using corrected contigs supplemented with the user’s choice of existing reference sequences, SeqKit (Shen et al., 2016) for cleaning shiver assembly for QUAST, QUAST (Gurevich et al., 2013) to evaluate/assess the quality of genome assemblies and MultiQC (Ewels et al., 2016) for aggregation of the results from fastp, BWA and QUAST. Our pipeline was successfully tested and validated with SARS-CoV-2 (n = 20), HIV-1 (n = 20) and Lassa Virus (n = 20) datasets all of which have been made publicly available. VGEA is freely available on GitHub at: https://github.com/pauloluniyi/VGEA under the GNU General Public License.

Highlights

  • The most abundant biological entities on Earth are viruses as they can be found among all cellular forms of life

  • The dependencies are fastp for read trimming and read-level quality control, Burrows-Wheeler Alignment tool (BWA) for mapping sequencing reads to the human reference genome, SAMtools for extracting unmapped reads and for splitting bam files into fastq files, IVA for de novo assembly to generate contigs, shiver to pre-process reads for quality and contamination, map to a reference tailored to the sample using corrected contigs supplemented with the user’s choice of existing reference sequences, SeqKit for cleaning shiver assembly for QUAST, QUAST to evaluate/assess the quality of genome assemblies and MultiQC for aggregation of the results from fastp, BWA and QUAST

  • VGEA maps reads to the human reference genome in order to remove human contaminants, the pipeline carries out this step using BWA

Read more

Summary

Introduction

The most abundant biological entities on Earth are viruses as they can be found among all cellular forms of life. In recent times (past two decades), a number of these viruses have emerged in the human population causing disease outbreaks and sometimes pandemics These viruses include mainly: Influenza virus, Severe Acute Respiratory Syndrome (SARS) coronavirus, Middle East Respiratory Syndrome (MERS) coronavirus, Ebola virus, Yellow fever virus, Lassa virus (LASV), Zika virus (Chan, 2002; Bean et al, 2013; Folarin et al, 2016; Grubaugh et al, 2017; Metsky et al, 2017; Siddle et al, 2018; Ajogbasile et al, 2020) and SARS-CoV-2 (Chen et al, 2020; Holshue et al, 2020; Sohrabi et al, 2020). The field of genomics, especially pathogen genomics has been transformed by NGS, with costs constantly decreasing, equipment becoming more portable/field deployable during outbreaks and remarkable increase in data availability

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call