A negative storage model for precise but compact storage of genetic variation data.

Jamie K Teer,Guillermo Gonzalez-Calderon,Rodrigo Carvajal,Ruizheng Liu

doi:10.1093/database/baz158

Abstract

Falling sequencing costs and large initiatives are resulting in increasing amounts of data available for investigator use. However, there are informatics challenges in being able to access genomic data. Performance and storage are well-appreciated issues, but precision is critical for meaningful analysis and interpretation of genomic data. There is an inherent accuracy vs. performance trade-off with existing solutions. The most common approach (Variant-only Storage Model, VOSM) stores only variant data. Systems must therefore assume that everything not variant is reference, sacrificing precision and potentially accuracy. A more complete model (Full Storage Model, FSM) would store the state of every base (variant, reference and missing) in the genome thereby sacrificing performance. A compressed variation of the FSM can store the state of contiguous regions of the genome as blocks (Block Storage Model, BLSM), much like the file-based gVCF model. We propose a novel approach by which this state is encoded such that both performance and accuracy are maintained. The Negative Storage Model (NSM) can store and retrieve precise genomic state from different sequencing sources, including clinical and whole exome sequencing panels. Reduced storage requirements are achieved by storing only the variant and missing states and inferring the reference state. We evaluate the performance characteristics of FSM, BLSM and NSM and demonstrate dramatic improvements in storage and performance using the NSM approach.

Highlights

Parallel sequencing results represent the latest emergence of ‘big data’ in the life sciences
From studies of human genetic variation and tumor mutations, we know that the majority of positions in the genome are invariant between individuals
We have proposed a novel model for the precise storage of genetic variation data

Summary

Introduction

Parallel sequencing results represent the latest emergence of ‘big data’ in the life sciences. As more and more samples are sequenced, database storage technologies are increasingly being leveraged to store genetic data. In addition to classic SQL models, new database philosophies have been developed to focus more on the ability to store larger amounts of data in part by dispensing with some of the strict rules of data organization and de-duplication (‘normalization’) that characterize SQL databases. Such systems, often termed ‘NoSQL’ databases, have proven their worth via extensive use in internet content storage systems used in current popular social media products. The addition of genomic variation data allows for integrated queries that enable improved understanding of the relationship between genotype and phenotype

Methods

Results

Conclusion