Abstract

The perpetually increasing rate at which viral full-genome sequences are being determined is creating a pressing demand for computational tools that will aid the objective classification of these genome sequences. Taxonomic classification approaches that are based on pairwise genetic identity measures are potentially highly automatable and are progressively gaining favour with the International Committee on Taxonomy of Viruses (ICTV). There are, however, various issues with the calculation of such measures that could potentially undermine the accuracy and consistency with which they can be applied to virus classification. Firstly, pairwise sequence identities computed based on multiple sequence alignments rather than on multiple independent pairwise alignments can lead to the deflation of identity scores with increasing dataset sizes. Also, when gap-characters need to be introduced during sequence alignments to account for insertions and deletions, methodological variations in the way that these characters are introduced and handled during pairwise genetic identity calculations can cause high degrees of inconsistency in the way that different methods classify the same sets of sequences. Here we present Sequence Demarcation Tool (SDT), a free user-friendly computer program that aims to provide a robust and highly reproducible means of objectively using pairwise genetic identity calculations to classify any set of nucleotide or amino acid sequences. SDT can produce publication quality pairwise identity plots and colour-coded distance matrices to further aid the classification of sequences according to ICTV approved taxonomic demarcation criteria. Besides a graphical interface version of the program for Windows computers, command-line versions of the program are available for a variety of different operating systems (including a parallel version for cluster computing platforms).

Highlights

  • The ever advancing rate at which novel viral genomes are being determined is creating a serious challenge both for taxonomists seeking to ensure the consistent and accurate classification of these genomes, and for laboratory virologists attempting to accurately name newly determined genome sequences prior to deposition into public sequence databases

  • Given that in many cases the only taxonomically useful information that is available for a particular genome sequence is the sequence data itself, the use of pairwise nucleotide sequence identity measures is becoming increasingly popular as a means of objectively classifying bacteria [1] and viruses [2,3] into consistent and practically useful operational taxonomic units (OTUs) such as variants, strains, species or genera

  • Sequence identity calculation Given an input FASTA file, Sequence Demarcation Tool (SDT) aligns every unique pair of sequences (S sequences yield [S6(S-1)]/2 alignments) using the NW algorithms implemented in MUSCLE [4], ClustalW [5] or MAFFT [6], and computes the identity score for each pair of sequences as 1-M/N, where M is the number of mismatched nucleotides and N is the total number of columns along the alignment where neither sequence has a gap character

Read more

Summary

Introduction

The ever advancing rate at which novel viral genomes are being determined is creating a serious challenge both for taxonomists seeking to ensure the consistent and accurate classification of these genomes, and for laboratory virologists attempting to accurately name newly determined genome sequences prior to deposition into public sequence databases. Given a new virus sequence and the desire to classify it based on an established ICTV approved species demarcation threshold, there are many different ways in which a researcher might determine whether or not it should be included within an already established species Computer programs such as MUSCLE [4], CLUSTALW [5], MAFFT [6] or BLAST [7] could be used to make either multiple individual pairwise sequence alignments or a single multiple sequence alignment and other programs such as MEGA5 [8], PHYLIP [9], PAUP [10] or GENEIOUS (http://www.geneious.com/) could be used to calculate a variety of different pairwise identity scores. For a given pair of sequences, different combinations of alignment and pairwise identity calculation approaches will in many cases yield a fairly broad range of possible sequence identity scores

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call