Learning-based short text compression using BERT models

Emir Öztürk,Altan Mesut

doi:10.7717/peerj-cs.2423

Abstract

Learning-based data compression methods have gained significant attention in recent years. Although these methods achieve higher compression ratios compared to traditional techniques, their slow processing times make them less suitable for compressing large datasets, and they are generally more effective for short texts rather than longer ones. In this study, MLMCompress, a word-based text compression method that can utilize any BERT masked language model is introduced. The performance of MLMCompress is evaluated using four BERT models: two large models and two smaller models referred to as “tiny”. The large models are used without training, while the smaller models are fine-tuned. The results indicate that MLMCompress, when using the best-performing model, achieved 3838% higher compression ratios for English text and 42% higher compression ratios for multilingual text compared to NNCP, another learning-based method. Although the method does not yield better results than GPTZip, which has been developed in recent years, it achieves comparable outcomes while being up to 35 times faster in the worst-case scenario. Additionally, it demonstrated a 20% improvement in compression speed and a 180% improvement in decompression speed in the best case. Furthermore, MLMCompress outperforms traditional compression methods like Gzip and specialized short text compression methods such as Smaz and Shoco, particularly in compressing short texts, even when using smaller models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Learning-based short text compression using BERT models

Abstract

Talk to us

Similar Papers

More From: PeerJ Computer Science

Lead the way for us

Journal: PeerJ Computer Science	Publication Date: Oct 18, 2024
License type: CC BY 4.0

Similar Papers

SSRNet: Scalable 3D Surface Reconstruction Network
Zhenxing Mi ... Wenbing Tao
-
Zhenxing Mi, et. al.Zhenxing Mi ... Wenbing Tao
01 Jun 2020
01 Jun 2020

Using outlier elimination to assess learning-based correspondence matching methods
Xintao Ding ... Yongqiang Cheng
Information Sciences | VOL. 659
Xintao Ding, et. al.Xintao Ding ... Yongqiang Cheng
02 Jan 2024
Information Sciences | VOL. 659

Modern lossless compression techniques: Review, comparison and analysis
Apoorv Gupta ... Aman Bansal
-
Apoorv Gupta, et. al.Apoorv Gupta ... Aman Bansal
01 Feb 2017
01 Feb 2017

Compressing MEBES data enabling multi-threaded decompression
Mark Pereira ... Anil Parchuri
-
Mark Pereira, et. al.Mark Pereira ... Anil Parchuri
05 Oct 2007
05 Oct 2007

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Learning-based short text compression using BERT models

Abstract

Talk to us

Similar Papers

More From: PeerJ Computer Science