An Adaptive BWT-HMM-based Lossless Compression System for Genomic Data

I Gede Eka Sulistyawan,Muhammad Hilman Fatoni,Achmad Arifin

doi:10.1109/cenim51130.2020.9297871

Abstract

For many years, the Burrows-Wheeler Transform (BWT) had been employed in data compression. This BWT-based compression is facing inflexibility problems due to their text-dependent. To deal with this problem, we took the opportunity to combine BWT with Hidden Markov Model (HMM) as a compression system. BWT employed to produce a clustered single character structure, meanwhile, HMM employed to predict the Genomic Data through the cluster. Here we performed a learning algorithm (the Baum-Welch EM Algorithm) to improve the compression ratio by re-estimating the model to the Genomic Data. The highest single and mean compression ratio produced is 4.276 and 4.004 respectively, with the possibility of improved compression ratio as much as 2.90% before saturation. Furthermore, this compression system still interesting to be developed on these topics, i.e. developing the HMM to cope with complex patterns and performing offline re-estimation to reduce time consumption.

Full Text