Abstract

Due to the advancement of high-throughput sequencing technologies, it is now feasible for sequencing individual genomes in a fast and affordable manner. With the significant increase in the number of individual genomes, compression methods are needed to reduce pressure on data storage as well as enable effective data distribution and management. The compression methods can generally be divided into two classes, namely reference-free methods and reference-based methods. In reference-free methods, redundancies within the target DNA sequence to be compressed are explored. In reference-based methods, redundancies between the target DNA sequence and other reference sequences are identified to achieve compression. This type of method is applicable to population sequences which are highly similar to each other and have a small number of mismatches. Some of the methods can also be applied to partially similar sequences such as chromosome sequences or sequences having evolutionary relationship. The authors highlight recent developments in these methods. In the comparative study, the authors’ simulation results reveal that the selection of a reference sequence is a crucial factor affecting the compression performance. Use of multiple number of reference sequences and enhancement strategies such as reference rewriting are important to achieve a large compression gain.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.