High density genotype storage for plant breeding in the Chado schema of Breedbase.

Nicolas Morales,Lukas A Mueller,Adrian F Powell,Isaak Y Tecle,Guillaume J Bauchet,Bryan J Ellerbrock,Titima Tantikanjana,Junwen Wang

doi:10.1371/journal.pone.0240059

Abstract

Modern breeding programs routinely use genome-wide information for selecting individuals to advance. The large volumes of genotypic information required present a challenge for data storage and query efficiency. Major use cases require genotyping data to be linked with trait phenotyping data. In contrast to phenotyping data that are often stored in relational database schemas, next-generation genotyping data are traditionally stored in non-relational storage systems due to their extremely large scope. This study presents a novel data model implemented in Breedbase (https://breedbase.org/) for uniting relational phenotyping data and non-relational genotyping data within the open-source PostgreSQL database engine. Breedbase is an open-source, web-database designed to manage all of a breeder’s informatics needs: management of field experiments, phenotypic and genotypic data collection and storage, and statistical analyses. The genotyping data is stored in a PostgreSQL data-type known as binary JavaScript Object Notation (JSONb), where the JSON structures closely follow the Variant Call Format (VCF) data model. The Breedbase genotyping data model can handle different ploidy levels, structural variants, and any genotype encoded in VCF. JSONb is both compressed and indexed, resulting in a space and time efficient system. Furthermore, file caching maximizes data retrieval performance. Integration of all breeding data within the Chado database schema retains referential integrity that may be lost when genotyping and phenotyping data are stored in separate systems. Benchmarking demonstrates that the system is fast enough for computation of a genomic relationship matrix (GRM) and genome wide association study (GWAS) for datasets involving 1,325 diploid Zea mays, 314 triploid Musa acuminata, and 924 diploid Manihot esculenta samples genotyped with 955,690, 142,119, and 287,952 genotype-by-sequencing (GBS) markers, respectively.

Highlights

Routine genotyping is possible with the advent of low-cost, high-throughput genotyping platforms, giving rise to enormous amounts of data but presenting challenges for data management and queriability [1]
JSONb is a binary formatted JavaScript Object Notation (JSON) field, allowing for compressed data sizes and faster queries in some scenarios. Breeding methods such as genome wide association study (GWAS) and GS depend on large genotypic data and metadata, generally stored in a standardized Variant Call Format (VCF) structure
The JSON genotype storage model presented here closely follows the VCF specification and can handle any kind of variant encoded in VCF, such as different ploidy levels, multiple alleles, insertions or deletions, and structural variants

Summary

Introduction

Routine genotyping is possible with the advent of low-cost, high-throughput genotyping platforms, giving rise to enormous amounts of data but presenting challenges for data management and queriability [1]. To serve these three scenarios effectively and efficiently, it is critical to store germplasm, pedigrees, experimental designs, and phenotypic and genotypic information under a unified architecture These services can either be implemented within a single database or provided by independent applications interconnected via application programming interfaces (APIs) such as, the publicly specified Plant Breeding API (BrAPI) [8]. Data can be stored via nested objects composed of heterogeneous keys and values, allowing for flexibility in the data structure and model; often non-relational data is structured using JavaScript Object Notation (JSON). JSONb is a binary formatted JSON field, allowing for compressed data sizes and faster queries in some scenarios Breeding methods such as GWAS and GS depend on large genotypic data and metadata, generally stored in a standardized Variant Call Format (VCF) structure. The preferred format for uploading and downloading genotypic data in Breedbase is VCF

Chado schema modifications

Genotype storage JSON structures

Caching of results

Example SQL queries

Packaged queries

Limitations

Performance benchmark

Scalability and continued development

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PloS one	Publication Date: Nov 11, 2020
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

High density genotype storage for plant breeding in the Chado schema of Breedbase.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Similar Papers

Genomic Selection: Status in Different Species and Challenges for Breeding
Kf Stock ... R Reents
Reproduction in domestic animals = Zuchthygiene | VOL. 48
Kf Stock, et. al.Kf Stock ... R Reents
21 Aug 2013
Reproduction in domestic animals = Zuchthygiene | VOL. 48

GWAS Analyzer: integrating genotype, phenotype and public annotation data for genome-wide association study analysis
Christine Fong ... Matthew Radey
Computer applications in the biosciences : CABIOS | VOL. 26
Christine Fong, et. al.Christine Fong ... Matthew Radey
06 Jan 2010
Computer applications in the biosciences : CABIOS | VOL. 26

Linking Genomic Data with Phenotypes Derived from Electronic Health Records.
Phil Appleby
International Journal for Population Data Science | VOL. 1
Phil ApplebyPhil Appleby
18 Apr 2017
International Journal for Population Data Science | VOL. 1

The Psychiatric GWAS Consortium: Big Science Comes to Psychiatry
Patrick F Sullivan
Neuron | VOL. 68
Patrick F SullivanPatrick F Sullivan
01 Oct 2010
Neuron | VOL. 68

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

High density genotype storage for plant breeding in the Chado schema of Breedbase.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one