Abstract

ViruSurf, available at http://gmql.eu/virusurf/, is a large public database of viral sequences and integrated and curated metadata from heterogeneous sources (RefSeq, GenBank, COG-UK and NMDC); it also exposes computed nucleotide and amino acid variants, called from original sequences. A GISAID-specific ViruSurf database, available at http://gmql.eu/virusurf_gisaid/, offers a subset of these functionalities. Given the current pandemic outbreak, SARS-CoV-2 data are collected from the four sources; but ViruSurf contains other virus species harmful to humans, including SARS-CoV, MERS-CoV, Ebola and Dengue. The database is centered on sequences, described from their biological, technological and organizational dimensions. In addition, the analytical dimension characterizes the sequence in terms of its annotations and variants. The web interface enables expressing complex search queries in a simple way; arbitrary search queries can freely combine conditions on attributes from the four dimensions, extracting the resulting sequences. Several example queries on the database confirm and possibly improve results from recent research papers; results can be recomputed over time and upon selected populations. Effective search over large and curated sequence data may enable faster responses to future threats that could arise from new viruses.

Highlights

  • The pandemic outbreak of the coronavirus disease COVID19, caused by the virus species SARS-CoV-2, has created unprecedented attention toward the genetic mechanisms of viruses

  • All tables have a numerical sequential primary key (PK), conventionally named using the table name and the postfix ‘ id’, and indicated as PK in Figure 1; we indicate with foreign keys (FK) the relationships from a non-key attribute to a primary key attribute of a different table

  • The web interface of ViruSurf is composed of four sections, numbered in Figure 4: [1] the menu bar, for accessing services, documentation and query utilities; [2] the search interface over metadata attributes; [3] the search interface over annotations and nucleotide/amino acid variants; [4] the result visualization section, showing resulting sequences with their metadata

Read more

Summary

Introduction

The pandemic outbreak of the coronavirus disease COVID19, caused by the virus species SARS-CoV-2, has created unprecedented attention toward the genetic mechanisms of viruses. The sudden outbreak has shown that the research community is generally unprepared to face pandemic crises in a number of aspects, including well-organized databases and search systems. We respond to such urgent need by means of a novel integrated database and search system collecting and curating virus sequences with their properties. We are driven by the Viral Conceptual Model (VCM) for virus sequences [1], which was recently developed by interviewing a variety of experts of the various aspects of virus research (including clinicians, epidemiologists, drug and vaccine developers). Variants are extracted by performing data analysis and include both nucleotide variants––with respect to the reference sequence for the specific species––with their impact, and amino acid variants related to the genes

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.