Abstract
ViruSurf, available at http://gmql.eu/virusurf/, is a large public database of viral sequences and integrated and curated metadata from heterogeneous sources (RefSeq, GenBank, COG-UK and NMDC); it also exposes computed nucleotide and amino acid variants, called from original sequences. A GISAID-specific ViruSurf database, available at http://gmql.eu/virusurf_gisaid/, offers a subset of these functionalities. Given the current pandemic outbreak, SARS-CoV-2 data are collected from the four sources; but ViruSurf contains other virus species harmful to humans, including SARS-CoV, MERS-CoV, Ebola and Dengue. The database is centered on sequences, described from their biological, technological and organizational dimensions. In addition, the analytical dimension characterizes the sequence in terms of its annotations and variants. The web interface enables expressing complex search queries in a simple way; arbitrary search queries can freely combine conditions on attributes from the four dimensions, extracting the resulting sequences. Several example queries on the database confirm and possibly improve results from recent research papers; results can be recomputed over time and upon selected populations. Effective search over large and curated sequence data may enable faster responses to future threats that could arise from new viruses.
Highlights
The pandemic outbreak of the coronavirus disease COVID19, caused by the virus species SARS-CoV-2, has created unprecedented attention toward the genetic mechanisms of viruses
All tables have a numerical sequential primary key (PK), conventionally named using the table name and the postfix ‘ id’, and indicated as PK in Figure 1; we indicate with foreign keys (FK) the relationships from a non-key attribute to a primary key attribute of a different table
The web interface of ViruSurf is composed of four sections, numbered in Figure 4: [1] the menu bar, for accessing services, documentation and query utilities; [2] the search interface over metadata attributes; [3] the search interface over annotations and nucleotide/amino acid variants; [4] the result visualization section, showing resulting sequences with their metadata
Summary
The pandemic outbreak of the coronavirus disease COVID19, caused by the virus species SARS-CoV-2, has created unprecedented attention toward the genetic mechanisms of viruses. The sudden outbreak has shown that the research community is generally unprepared to face pandemic crises in a number of aspects, including well-organized databases and search systems. We respond to such urgent need by means of a novel integrated database and search system collecting and curating virus sequences with their properties. We are driven by the Viral Conceptual Model (VCM) for virus sequences [1], which was recently developed by interviewing a variety of experts of the various aspects of virus research (including clinicians, epidemiologists, drug and vaccine developers). Variants are extracted by performing data analysis and include both nucleotide variants––with respect to the reference sequence for the specific species––with their impact, and amino acid variants related to the genes
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.