Application of signal processing for DNA sequence compression

Bonnie Ngai‐Fong Law

doi:10.1049/iet-spr.2018.5392

Abstract

Due to the advancement of high-throughput sequencing technologies, it is now feasible for sequencing individual genomes in a fast and affordable manner. With the significant increase in the number of individual genomes, compression methods are needed to reduce pressure on data storage as well as enable effective data distribution and management. The compression methods can generally be divided into two classes, namely reference-free methods and reference-based methods. In reference-free methods, redundancies within the target DNA sequence to be compressed are explored. In reference-based methods, redundancies between the target DNA sequence and other reference sequences are identified to achieve compression. This type of method is applicable to population sequences which are highly similar to each other and have a small number of mismatches. Some of the methods can also be applied to partially similar sequences such as chromosome sequences or sequences having evolutionary relationship. The authors highlight recent developments in these methods. In the comparative study, the authors’ simulation results reveal that the selection of a reference sequence is a crucial factor affecting the compression performance. Use of multiple number of reference sequences and enhancement strategies such as reference rewriting are important to achieve a large compression gain.

Full Text