Abstract

Phages play critical roles in the survival and pathogenicity of their hosts, via lysogenic conversion factors, and in nutrient redistribution, via cell lysis. Analyses of phage- and viral-encoded genes in environmental samples provide insights into the physiological impact of viruses on microbial communities and human health. However, phage ORFs are extremely diverse of which over 70% of them are dissimilar to any genes with annotated functions in GenBank. Better identification of viruses would also aid in better detection and diagnosis of disease, in vaccine development, and generally in better understanding the physiological potential of any environment. In contrast to enzymes, viral structural protein function can be much more challenging to detect from sequence data because of low sequence conservation, few known conserved catalytic sites or sequence domains, and relatively limited experimental data. We have designed a method of predicting phage structural protein sequences that uses Artificial Neural Networks (ANNs). First, we trained ANNs to classify viral structural proteins using amino acid frequency; these correctly classify a large fraction of test cases with a high degree of specificity and sensitivity. Subsequently, we added estimates of protein isoelectric points as a feature to ANNs that classify specialized families of proteins, namely major capsid and tail proteins. As expected, these more specialized ANNs are more accurate than the structural ANNs. To experimentally validate the ANN predictions, several ORFs with no significant similarities to known sequences that are ANN-predicted structural proteins were examined by transmission electron microscopy. Some of these self-assembled into structures strongly resembling virion structures. Thus, our ANNs are new tools for identifying phage and potential prophage structural proteins that are difficult or impossible to detect by other bioinformatic analysis. The networks will be valuable when sequence is available but in vitro propagation of the phage may not be practical or possible.

Highlights

  • As modern sequencing technologies exponentially increase the amount of DNA sequence data available, the discovery of sequences that encode proteins with unknown functions continue to accumulate

  • We chose to represent protein sequences by amino acid percent composition because Artificial Neural Networks (ANNs) trained by other encodings, such as the hydropathy index of individual amino acids, were not as successful [44]

  • The total number of ANNs used for voting is 160 from which the optimum values of the training error, specificity, and sensitivity were assessed using the best voting scheme against a curated test set of phage sequences that we manually labeled as structural or non-structural proteins

Read more

Summary

Introduction

As modern sequencing technologies exponentially increase the amount of DNA sequence data available, the discovery of sequences that encode proteins with unknown functions continue to accumulate. A large majority of microbial and viral metagenome sequences sampled from different environments have unknown function based on similarity to known sequences [1,2,3,4]. The remarkable biodiversity of viruses and the fact that sampling and in-depth genetic and biochemical studies of protein functions have been biased until relatively recently toward biomedically important or model organisms limits the utility of similarity-based annotation methods. Viral diversity is partly driven by viral structural protein genes, such as those encoding tails and tail fibers, which participate directly in the evolutionary contest between viruses and their hosts. Discovering the functions of unknown viral sequences is important for understanding the lifestyle and effects of viruses in the environment, the genetic relationship between viruses and their hosts, and the influence of viruses on the development of new pathogens

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call