Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity.

Youde Ding,Jianfeng Ma,Ji He,Guiying Zhang,Jing Wang,Xuemei Liu,Yuan Liao,Xu Wei

doi:10.3389/fgene.2023.1213907

Abstract

Background: With the rapid development of high-throughput sequencing technology and the explosive growth of genomic data, storing, transmitting and processing massive amounts of data has become a new challenge. How to achieve fast lossless compression and decompression according to the characteristics of the data to speed up data transmission and processing requires research on relevant compression algorithms. Methods: In this paper, a compression algorithm for sparse asymmetric gene mutations (CA_SAGM) based on the characteristics of sparse genomic mutation data was proposed. The data was first sorted on a row-first basis so that neighboring non-zero elements were as close as possible to each other. The data were then renumbered using the reverse Cuthill-Mckee sorting technique. Finally the data were compressed into sparse row format (CSR) and stored. We had analyzed and compared the results of the CA_SAGM, coordinate format (COO) and compressed sparse column format (CSC) algorithms for sparse asymmetric genomic data. Nine types of single-nucleotide variation (SNV) data and six types of copy number variation (CNV) data from the TCGA database were used as the subjects of this study. Compression and decompression time, compression and decompression rate, compression memory and compression ratio were used as evaluation metrics. The correlation between each metric and the basic characteristics of the original data was further investigated. Results: The experimental results showed that the COO method had the shortest compression time, the fastest compression rate and the largest compression ratio, and had the best compression performance. CSC compression performance was the worst, and CA_SAGM compression performance was between the two. When decompressing the data, CA_SAGM performed the best, with the shortest decompression time and the fastest decompression rate. COO decompression performance was the worst. With increasing sparsity, the COO, CSC and CA_SAGM algorithms all exhibited longer compression and decompression times, lower compression and decompression rates, larger compression memory and lower compression ratios. When the sparsity was large, the compression memory and compression ratio of the three algorithms showed no difference characteristics, but the rest of the indexes were still different. Conclusion: CA_SAGM was an efficient compression algorithm that combines compression and decompression performance for sparse genomic mutation data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity.

Abstract

Talk to us

Similar Papers

More From: Frontiers in genetics

Lead the way for us

Journal: Frontiers in genetics	Publication Date: Jun 1, 2023
License type: CC BY 4.0

Similar Papers

Genome-wide Transcriptome Profiling Reveals the Functional Impact of Rare De Novo and Recurrent CNVs in Autism Spectrum Disorders
Rui Luo ... Daniel H Geschwind
The American Journal of Human Genetics | VOL. 91
Rui Luo, et. al.Rui Luo ... Daniel H Geschwind
21 Jun 2012
The American Journal of Human Genetics | VOL. 91

GDedup: Distributed File System Level Deduplication for Genomic Big Data
Paul Bartus ... Emmanuel Arzuaga
-
Paul Bartus, et. al.Paul Bartus ... Emmanuel Arzuaga
01 Jul 2018
01 Jul 2018

WBFQC: A new approach for compressing next-generation sequencing data splitting into homogeneous streams.
Sanjeev Kumar ... Ranvijay
Journal of bioinformatics and computational biology | VOL. 16
Sanjeev Kumar, et. al.Sanjeev Kumar ... Ranvijay
01 Oct 2018
Journal of bioinformatics and computational biology | VOL. 16

Building a Research-Quality Copy Number Variation Data Repository for Translational Research
Chen Wang ... Xiaonan Hou
-
Chen Wang, et. al.Chen Wang ... Xiaonan Hou
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity.

Abstract

Talk to us

Similar Papers

More From: Frontiers in genetics