Revolutionary advances in DNA sequencing technologies fundamentally change the nature of genomics. Today's sequencing technologies have opened into an outburst in genomic data volume. These data can be used in various applications where long-term storage and analysis of genomic sequence data are required. Data-specific compression algorithms can effectively manage a large volume of data. In recent times, deep learning has achieved great success in many compression tools and is gradually being used in genomic sequence compression. Significantly, autoencoder has been applied in dimensionality reduction, compact representations of data, and generative model learning. It can use convolutional layers to learn essential features from input data, which is better for image and series data. Autoencoder reconstructs the input data with some loss of information. Since accuracy is critical in genomic data, compressed genomic data must be decompressed without any information loss. We introduce a new scheme to address the loss incurred in the decompressed data of the autoencoder. This paper proposes a novel algorithm called GenCoder for reference-free compression of genomic sequences using a convolutional autoencoder and regenerating the genomic sequences from a latent code produced by the autoencoder, and retrieving original data losslessly. Performance evaluation is conducted on various genomes and benchmarked datasets. The experimental results on the tested data demonstrate that the deep learning model used in the proposed compression algorithm generalizes well for genomic sequence data and achieves a compression gain of 27% over the best state-of-the-art method.
Read full abstract