Abstract

Modern daily life activities produced lots of information for the advancement of telecommunication. It is a challenging issue to store them on a digital device or transmit it over the Internet, leading to the necessity for data compression. Thus, research on data compression to solve the issue has become a topic of great interest to researchers. Moreover, the size of compressed data is generally smaller than its original. As a result, data compression saves storage and increases transmission speed. In this article, we propose a text compression technique using GPT-2 language model and Huffman coding. In this proposed method, Burrows-Wheeler transform and a list of keys are used to reduce the original text file’s length. Finally, we apply GPT-2 language mode and then Huffman coding for encoding. This proposed method is compared with the state-of-the-art techniques used for text compression. Finally, we show that the proposed method demonstrates a gain in compression ratio compared to the other state-of-the-art methods.

Highlights

  • It is not easy to manage the increasing amount of data produced every day, especially in medical centers and on social media

  • Though many text compression techniques have already been developed, current technology needs a more effective text compression strategy. From this point of view, we propose a straightforward but efficient lossless text compression procedure using Generative Pre-trained Transformer 2 (GPT-2) language model and Huffman coding in this paper

  • GPT-2 works with Byte Pair Encoding and provides much fewer Hangul characters than the original one, and Huffman coding provides a better result for a small number of symbols

Read more

Summary

Introduction

It is not easy to manage the increasing amount of data produced every day, especially in medical centers and on social media. BurrowsWheeler Transform (BWT), Huffman coding, LZW (Lempel-Ziv-Welch), LZMA, Gzip, Bzip, and Deflate are the most popular text compression algorithms [10,11]. Gzip is an LZ77 and Huffman coding-based text compression algorithm and provides a speedy compression than Deflate reported in [16,17]. Rahman et al show a Burrows-Wheeler transform (BWT), pattern matching, and Huffman coding-based text compression technique in [18] and Claims a better compression than Deflate, Bzip, Gzip, LZMA, and LZW. Though many text compression techniques have already been developed, current technology needs a more effective text compression strategy From this point of view, we propose a straightforward but efficient lossless text compression procedure using GPT-2 language model and Huffman coding in this paper.

Background studies
Proposed method
Experimental results and analysis
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.