Abstract

SummaryMany aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatics knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment.Availability and implementationCLI ENA upload tool is available at github.com/usegalaxy-eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and github.com/galaxyproject/tools-iuc/tree/master/tools/ena_upload (development); and ENA upload Galaxy container at github.com/ELIXIR-Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785).

Highlights

  • The current COVID-19 pandemic caused by the SARS-CoV-2 virus has highlighted the importance of open and FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al, 2016) data

  • Genome sequences of SARS-CoV-2 are available since early January 2020 (Wu et al, 2020) and have enabled, among other things, the design of PCR tests and vaccines

  • It differs from European Nucleotide Archive (ENA) and other International Nucleotide Sequence Database Collaboration (INSDC) repositories in some key aspects: access to data and data submission are only possible after application and registration; reuse of data is restricted; and only the consensus sequence of assembled genomes is accepted

Read more

Summary

Introduction

The current COVID-19 pandemic caused by the SARS-CoV-2 virus has highlighted the importance of open and FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al, 2016) data. GISAID is one of the data resources mentioned in the European guidelines for open access to COVID-19 data It differs from ENA and other INSDC repositories in some key aspects: access to data and data submission are only possible after application and registration; reuse of data is restricted; and only the consensus sequence of assembled genomes is accepted. This has two important implications for the research community: the majority of SARS-CoV-2 genomes are not fully FAIR and their underlying raw data remains unpublished We believe that this discrepancy is in part caused by the technical barrier to submitting large amounts of raw reads to ENA which requires command line knowledge and metadata in XML format, putting off many researchers and clinicians.

CLI ENA upload tool
Galaxy ENA upload tool
ENA upload Galaxy container
Implementation
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call