Abstract

Archives operating under the International Nucleotide Sequence Database Collaboration currently preserve all submitted sequences equally, but rapid increases in the rate of global sequence production will soon require differentiated treatment of DNA sequences submitted for archiving. Here, we propose a graded system in which the ease of reproduction of a sequencing-based experiment and the relative availability of a sample for resequencing define the level of lossy compression applied to stored data.

Highlights

  • Framework for archiving Simple and utilitarian thinking must be applied to archiving DNA sequence data

  • For compression factors greater than 100, it is likely that one would require lossy behaviour on the actual sequence, i.e. error-correction of likely sequencing errors to provide a more compressible dataset. In this perspective piece we intend both to provide a framework in which to think about future DNA sequence archiving and to provide an initial opinion with concrete examples to encourage appropriate debate in the community

  • A recognised value of archiving experimental data is the opportunity to support alternative analysis and metaanalysis of the data for purposes not originally intended by the submitting scientist. This approach has yielded useful serendipitous outputs, including an assembled genome sequence from a Wolbachia species discovered as contaminant sequence in Drosophila sequencing data, and the calling of polymorphisms in the mouse genome from archived Celera traces [8,9]

Read more

Summary

Background

The vast majority of living organisms utilise nucleic acid as their primary store of genetic information. An additional property of advances in sequencing technology is that at the current rates of change, DNA sequencing costs will fall so low as to become negligible for some applications This will allow a far greater range of scientific experiments to be carried out, but will allow whimsical or nonsensical uses of DNA sequencing, and will generate additional pressure on storage resources. The analogy with image-based techniques is relevant, with perhaps only the most valuable images stored in a completely lossless manner even locally, with more routine storage at variable levels under lossy compression formats In this perspective piece, we explore the utility of different schemes for data reduction for a DNA sequence archive. We set out a framework in which to make data loss decisions, and explore the consequences of these decisions

Main text
The presence of a large excess of DNA in a robust physical archive
Findings
Discussion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.