Abstract

BackgroundThe increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models.FindingsWe benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of n}{}2.4%, n}{}7.1%, n}{}6.1%, n}{}5.8%, and n}{}6.0%, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in n}{}12.4%, n}{}11.7%, n}{}10.8%, and n}{}10.1% over the state of the art. The cost of this compression improvement is some additional computational time (1.7–3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art.ConclusionsGeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.

Highlights

  • The DNA sequencing rate is increasing exponentially, stretching genomics storage requirements to unprecedented dimensions

  • We describe the datasets and materials used for the benchmark, followed by the comparison with GeCo2 using different characteristics, number of models, and data redundancy

  • To estimate the cost of long-term storage, we developed a model with the following simplifying assumptions: ≥2 copies are stored; compression is done once and the result is copied to the different backup media; 1 central processing unit (CPU) core is at 100% utilization during compression; the cooling and transfer costs are ignored; the computing platform is idle when not compressing; and no human operator is waiting for the operations to terminate

Read more

Summary

Introduction

The DNA sequencing rate is increasing exponentially, stretching genomics storage requirements to unprecedented dimensions. Only a few recent articles propose the use of neural networks for DNA sequence compression. They fall short when compared with specific DNA compression tools, such as GeCo2. We combine the power of neural networks with specific DNA models For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. The RAM use is constant, and the tool scales efficiently, independently of the sequence size Overall, these values outperform the state of the art. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.