Abstract
Protein domain-based approaches to analyzing sequence data are valuable tools for examining and exploring genomic architecture across genomes of different organisms. Here, we present a complete dataset of domains from the publicly available sequence data of 9,051 reference viral genomes. The data provided contain information such as sequence position and neighboring domains from 30,947 pHMM-identified domains from each reference viral genome. Domains were identified from viral whole-genome sequence using automated profile Hidden Markov Models (pHMM). This study also describes the framework for constructing “domain neighborhoods”, as well as the dataset representing it. These data can be used to examine shared and differing domain architectures across viral genomes, to elucidate potential functional properties of genes, and potentially to classify viruses.
Highlights
Background and SummaryAdvancements in sequencing technology and the construction of large, publicly available genomic databases have widely expanded the potential for comparative genomics and discovery
Take E. coli, the best-studied bacteria, where one third of the proteome consists of proteins of unknown function
We ask if (1) genomes can be decomposed into a series of functional building blocks that (2) do not rely on annotated genes and that (3) can be used to classify new species or genes, and if (4) protein domains can serve as these building blocks
Summary
Advancements in sequencing technology and the construction of large, publicly available genomic databases have widely expanded the potential for comparative genomics and discovery. Defined protein domains provide just such building blocks and allow the decoding of some of this ambiguity across genomes This approach will be based off of the identification of viral domains using profile Hidden Markov models (pHMM) with HMMER3 http://hmmer.org/, v3.2.11. VFAM and pVOG have not been updated as recently as PFAM, they include many viral-associated domains not found in PFAM The contents of these three profile-HMM databases form the “PFAM database” referred to throughout this manuscript. A slew of recent papers has leveraged groups of protein domains to try to more broadly elucidate function These include a secretion resource[13], bacterial pathogenesis[14,15], and the study of temperature reactive domains[16]. A domain-based approach allows for the preservation of the functional complexity within the metagenome, but with a simpler dictionary and a more complete analysis[19], which we enable with this work
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.