Abstract

In any sequencing project, the possible depth of comparative analysis is determined largely by the amount and quality of the accompanying contextual data. The structure, content, and storage of this contextual data should be standardized to ensure consistent coverage of all sequenced entities and facilitate comparisons. The Genomic Standards Consortium (GSC) has developed the “Minimum Information about Genome/Metagenome Sequences (MIGS/MIMS)” checklist for the description of genomes and here we annotate all 30 publicly available marine bacteriophage sequences to the MIGS standard. These annotations build on existing International Nucleotide Sequence Database Collaboration (INSDC) records, and confirm, as expected that current submissions lack most MIGS fields. MIGS fields were manually curated from the literature and placed in XML format as specified by the Genomic Contextual Data Markup Language (GCDML). These “machine-readable” reports were then analyzed to highlight patterns describing this collection of genomes. Completed reports are provided in GCDML. This work represents one step towards the annotation of our complete collection of genome sequences and shows the utility of capturing richer metadata along with raw sequences.

Highlights

  • Researchers interested in marine viruses have long acknowledged the need to link genomic data to both biogeochemical contextual data and host sequence data in order to maximally investigate marine virus-host systems [1]

  • A comparison of International Nucleotide Sequence Database Collaboration (INSDC) reports and manually curated MIGS-compliant Genomic Contextual Data Markup Language (GCDML) reports Surveying the literature and the public databases identified a set of 27 phages isolated from a ‘marine’

  • Overall, when the minimum required resolution of the field “date” is 'year', only 21% of the components recommended by the MIGS checklist are reported in the current marine phage INSDC reports (Figure 3)

Read more

Summary

Introduction

Researchers interested in marine viruses have long acknowledged the need to link genomic data to both biogeochemical contextual data and host sequence data in order to maximally investigate marine virus-host systems [1]. The power to gain knowledge from any genomic venture depends heavily on the a priori sequence content of public databases with which to compare new sequences to, by sequence alignment approaches [7]. Sequences can be grouped by contextual parameters and interpreted in a comparative context only when these data are available and stored in an accurate, structured and accessible fashion. This allows for interpretation in light of other organisms (or communities), including habitat, isolation location, biological features, the molecular procedures applied to obtain genomic material, sequencing and post-sequencing methods. Given the vast number of sequences already available, these contextual descriptors are becoming as valuable as the nucleotides that make up the sequences. The descriptors expand the number of dimensions available in

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call