Abstract

In this opinion article, we discuss the formatting of files from (plant) genotyping studies, in particular the formatting of (meta-) data in Variant Call Format (VCF) files. The flexibility of the VCF format specification facilitates its use as a generic interchange format across domains but can lead to inconsistency between files in the presentation of metadata. To enable fully autonomous machine actionable data flow, generic elements need to be further specified. We strongly support the merits of the FAIR principles and see the need to facilitate them also through technical implementation specifications. VCF files are an established standard for the exchange and publication of genotyping data. Other data formats are also used to capture variant call data (for example, the HapMap format and the gVCF format), but none currently have the reach of VCF. In VCF, only the sites of variation are described, whereas in gVCF, all positions are listed, and confidence values are also provided. For the sake of simplicity, we will only discuss VCF and our recommendations for its use. However, the part of the VCF standard relating to metadata (as opposed to the actual variant calls) defines a syntactic format but no vocabulary, unique identifier or recommended content. In practice, often only sparse (if any) descriptive metadata is included. When descriptive metadata is provided, proprietary metadata fields are frequently added that have not been agreed upon within the community which may limit long-term and comprehensive interoperability. To address this, we propose recommendations for supplying and encoding metadata, focusing on use cases from the plant sciences. We expect there to be overlap, but also divergence, with the needs of other domains.

Highlights

  • As of today, there are several public repositories for genetic and genomic variation data. Most of these repositories are exclusive to humans and do not include other organisms (NCBI Insights, 2017), such as dbSNP (Sherry et al, 2001), dbGaP (Mailman et al, 2007) and dbVar (Lappalainen et al, 2013)

  • Data are only checked for a few critical points: first, the VCF file must comply with the Variant Call Format (VCF) (Danecek et al, 2011) specifications, second, the genome assembly used as reference must be registered with one of the databases of the International Nucleotide Sequence Database Collaboration (INSDC) (Cochrane et al, 2011), i.e., GenBank (Benson et al, 2013), the European Nucleotide Archive (ENA) (Leinonen et al, 2011) or the DNA Data Bank of Japan (DDBJ) (Mashima et al, 2017), respectively, and an accession number is available, and third, the VCF file must contain either allele frequencies and/or genotype information

  • In response to the points discussed previously, we propose a minimal list of metadata fields, recommend an identifier schema and guidelines for vocabulary and data format within a VCF file

Read more

Summary

24 Feb 2022 view

1. Boas Pucker , Institute of Plant Biology & BRICS, TU Braunschweig, Braunschweig, Germany Alenka Hafner, Penn State University, University Park, USA. Any reports and responses or comments on the article can be found at the end of the article. Descriptive metadata is provided, proprietary metadata fields are frequently added that have not been agreed upon within the community which may limit long-term and comprehensive interoperability. We propose recommendations for supplying and encoding metadata, focusing on use cases from the plant sciences. We expect there to be overlap, and divergence, with the needs of other domains. Keywords FAIR, plant, genotyping, snp, vcf, data management, phenotyping, ELIXIR. This article is included in the ELIXIR gateway

Introduction
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call