Domain-centric database to uncover structure of minimally characterized viral genomes

John C Bramley,Aaron Diantonio,Mark A Zaydman,Jeffrey D Milbrandt,William J Buchser,Alex L Yenkin

doi:10.1038/s41597-020-0536-1

John C Bramley, Aaron Diantonio + Show 4 more

Open Access

https://doi.org/10.1038/s41597-020-0536-1

Copy DOI

Journal: Scientific Data	Publication Date: Jun 25, 2020
Citations: 2	License type: open-access

Affiliation: Washington University in St. Louis

Abstract

Protein domain-based approaches to analyzing sequence data are valuable tools for examining and exploring genomic architecture across genomes of different organisms. Here, we present a complete dataset of domains from the publicly available sequence data of 9,051 reference viral genomes. The data provided contain information such as sequence position and neighboring domains from 30,947 pHMM-identified domains from each reference viral genome. Domains were identified from viral whole-genome sequence using automated profile Hidden Markov Models (pHMM). This study also describes the framework for constructing “domain neighborhoods”, as well as the dataset representing it. These data can be used to examine shared and differing domain architectures across viral genomes, to elucidate potential functional properties of genes, and potentially to classify viruses.

Highlights

Background and SummaryAdvancements in sequencing technology and the construction of large, publicly available genomic databases have widely expanded the potential for comparative genomics and discovery
Take E. coli, the best-studied bacteria, where one third of the proteome consists of proteins of unknown function
We ask if (1) genomes can be decomposed into a series of functional building blocks that (2) do not rely on annotated genes and that (3) can be used to classify new species or genes, and if (4) protein domains can serve as these building blocks

Summary

Background and Summary

Advancements in sequencing technology and the construction of large, publicly available genomic databases have widely expanded the potential for comparative genomics and discovery. Defined protein domains provide just such building blocks and allow the decoding of some of this ambiguity across genomes This approach will be based off of the identification of viral domains using profile Hidden Markov models (pHMM) with HMMER3 http://hmmer.org/, v3.2.11. VFAM and pVOG have not been updated as recently as PFAM, they include many viral-associated domains not found in PFAM The contents of these three profile-HMM databases form the “PFAM database” referred to throughout this manuscript. A slew of recent papers has leveraged groups of protein domains to try to more broadly elucidate function These include a secretion resource[13], bacterial pathogenesis[14,15], and the study of temperature reactive domains[16]. A domain-based approach allows for the preservation of the functional complexity within the metagenome, but with a simpler dictionary and a more complete analysis[19], which we enable with this work

Methods

Objective

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Domain-centric database to uncover structure of minimally characterized viral genomes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Data

Lead the way for us

Similar Papers

National center for biotechnology information viral genomes project.
Yiming Bao ... Mikhail Rozanov
Journal of virology | VOL. 78
Yiming Bao, et. al.Yiming Bao ... Mikhail Rozanov
25 Jun 2004
Journal of virology | VOL. 78

Traces of SARS-CoV-2 RNA in Peripheral Blood Cells of Patients with COVID-19.
Ahmed Moustafa ... Ramy K Aziz
OMICS: A Journal of Integrative Biology | VOL. 25
Ahmed Moustafa, et. al.Ahmed Moustafa ... Ramy K Aziz
19 Jul 2021
OMICS: A Journal of Integrative Biology | VOL. 25

VIRsiRNAdb: a curated database of experimentally validated viral siRNA/shRNA
Nishant Thakur ... Abid Qureshi
Nucleic Acids Research | VOL. 40
Nishant Thakur, et. al.Nishant Thakur ... Abid Qureshi
01 Dec 2011
Nucleic Acids Research | VOL. 40

1788. The Utility of Next-Generation Sequencing for Detection of Candidate Pathogens in Bronchoalveolar Lavage Fluid from Pediatric Patients with Respiratory Failure
Suguru Takeuchi ... Toshihiko Okumura
Open Forum Infectious Diseases | VOL. 6
Suguru Takeuchi, et. al.Suguru Takeuchi ... Toshihiko Okumura
23 Oct 2019
Open Forum Infectious Diseases | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Domain-centric database to uncover structure of minimally characterized viral genomes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Data