Abstract

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

Highlights

  • One of the most demanding challenges in data compression is related to the lossless compression of protein sequences

  • The usage of substitution tolerant models in biological sequences is crucial because they provide a solid improvement factor over highratio general-purpose data compressors and, are models that can be considered of specific biological nature [37,38]

  • As an example of identifying similar protein sequences in terms of quantity of information, we studied the most similar protein sequences, in the whole UniProt database, to the proteins of the human Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [69], respecting important Bioinformatics guidelines [70]

Read more

Summary

Introduction

One of the most demanding challenges in data compression is related to the lossless compression of protein (or amino acid) sequences. These sequences’ origins follow the gene expression process, from DNA to RNA, to make a functional product: a protein. The first phase is transcription, where the information in every cell’s DNA, possibly noncontiguous, is converted into small, portable RNA messages. The second phase is the translation, where each triplet of RNA is encoded into one of the twenty possible amino acids. It is essential to remember that a different triplet can create the same amino acid and, it is a lossy encoding process. A specific chain or set of chains of amino acids establishes a protein

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.