Abstract

Next-generation sequencing technologies provide an unparallelled opportunity for the characterization and discovery of known and novel viruses. Because viruses are known to have the highest mutation rates when compared to eukaryotic and bacterial organisms, we assess the extent to which eleven well-known alignment algorithms (BLAST, BLAT, BWA, BWA-SW, BWA-MEM, BFAST, Bowtie2, Novoalign, GSNAP, SHRiMP2 and STAR) can be used for characterizing mutated and non-mutated viral sequences - including those that exhibit RNA splicing - in transcriptome samples. To evaluate aligners objectively we developed a realistic RNA-Seq simulation and evaluation framework (RiSER) and propose a new combined score to rank aligners for viral characterization in terms of their precision, sensitivity and alignment accuracy. We used RiSER to simulate both human and viral read sequences and suggest the best set of aligners for viral sequence characterization in human transcriptome samples. Our results show that significant and substantial differences exist between aligners and that a digital-subtraction-based viral identification framework can and should use different aligners for different parts of the process. We determine the extent to which mutated viral sequences can be effectively characterized and show that more sensitive aligners such as BLAST, BFAST, SHRiMP2, BWA-SW and GSNAP can accurately characterize substantially divergent viral sequences with up to 15% overall sequence mutation rate. We believe that the results presented here will be useful to researchers choosing aligners for viral sequence characterization using next-generation sequencing data.

Highlights

  • Emerging and re-emerging infectious diseases in the past three decades have created a significant cause of concern worldwide and exerted a significant burden on public health

  • Even if the assumption of uniform distribution of expression levels of viral transcripts in our model is very approximate our model suggests that the ranking of aligners remains virtually unchanged as k varies between 0 and 1, except for Novoalign which scores significantly lower for mutation rates v2% when k~1

  • The 61076 reads simulated from host genomic regions not represented in the human were aligned to the hg19 reference genome as described in section 5 of the Materials S1. doi:10.1371/journal.pone.0076935.t009

Read more

Summary

Introduction

Emerging and re-emerging infectious diseases in the past three decades have created a significant cause of concern worldwide and exerted a significant burden on public health. In the past decade alone, we have seen epidemics of virus variants such as the avian influenza H5N1 and the swine flu H1N1 that still pose a significant threat to the public health [1]. Some infectious agents such as viruses have been found to be etiological agents of human cancer, causing 15% to 20% of all human tumors worldwide [2]. Despite significant progress in the fight against infectious diseases there is clearly a pressing need for fast and accurate methods in the discovery and identification of viral etiological agents

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call