Abstract

Protein domain-based approaches to analyzing sequence data are valuable tools for examining and exploring genomic architecture across genomes of different organisms. Here, we present a complete dataset of domains from the publicly available sequence data of 9,051 reference viral genomes. The data provided contain information such as sequence position and neighboring domains from 30,947 pHMM-identified domains from each reference viral genome. Domains were identified from viral whole-genome sequence using automated profile Hidden Markov Models (pHMM). This study also describes the framework for constructing “domain neighborhoods”, as well as the dataset representing it. These data can be used to examine shared and differing domain architectures across viral genomes, to elucidate potential functional properties of genes, and potentially to classify viruses.

Highlights

  • Background and SummaryAdvancements in sequencing technology and the construction of large, publicly available genomic databases have widely expanded the potential for comparative genomics and discovery

  • Take E. coli, the best-studied bacteria, where one third of the proteome consists of proteins of unknown function

  • We ask if (1) genomes can be decomposed into a series of functional building blocks that (2) do not rely on annotated genes and that (3) can be used to classify new species or genes, and if (4) protein domains can serve as these building blocks

Read more

Summary

Background and Summary

Advancements in sequencing technology and the construction of large, publicly available genomic databases have widely expanded the potential for comparative genomics and discovery. Defined protein domains provide just such building blocks and allow the decoding of some of this ambiguity across genomes This approach will be based off of the identification of viral domains using profile Hidden Markov models (pHMM) with HMMER3 http://hmmer.org/, v3.2.11. VFAM and pVOG have not been updated as recently as PFAM, they include many viral-associated domains not found in PFAM The contents of these three profile-HMM databases form the “PFAM database” referred to throughout this manuscript. A slew of recent papers has leveraged groups of protein domains to try to more broadly elucidate function These include a secretion resource[13], bacterial pathogenesis[14,15], and the study of temperature reactive domains[16]. A domain-based approach allows for the preservation of the functional complexity within the metagenome, but with a simpler dictionary and a more complete analysis[19], which we enable with this work

Methods
Objective
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call