A Syllable-Based Technique for Uyghur Text Compression

Wayit Abliz,Hao Wu,Tuergen Yibulayin,Aishan Wumaier,Jiamila Wushouer,Maihemuti Maimaiti,Kahaerjiang Abiderexiti

doi:10.3390/info11030172

Abstract

To improve utilization of text storage resources and efficiency of data transmission, we proposed two syllable-based Uyghur text compression coding schemes. First, according to the statistics of syllable coverage of the corpus text, we constructed a 12-bit and 16-bit syllable code tables and added commonly used symbols—such as punctuation marks and ASCII characters—to the code tables. To enable the coding scheme to process Uyghur texts mixed with other language symbols, we introduced a flag code in the compression process to distinguish the Unicode encodings that were not in the code table. The experiments showed that the 12-bit coding scheme had an average compression ratio of 0.3 on Uyghur text less than 4 KB in size and that the 16-bit coding scheme had an average compression ratio of 0.5 on text less than 2 KB in size. Our compression schemes outperformed GZip, BZip2, and the LZW algorithm on short text and could be effectively applied to the compression of Uyghur short text for storage and applications.

Highlights

Network data on the internet continues to increase significantly each year
We found that the number of occurrences for the six inherent syllabic structures accounted for the majority of the syllables
According to Equation (1), the average length of a Uyghur syllable was 2.4 characters; theoretically, no matter how large the text size was, the compression ratio was stable at CRB12 = 12/(2.4 × 16) = 0.31 and CRB16 = 16/(2.4 × 16) = 0.42 or so

Summary

Introduction

Network data on the internet continues to increase significantly each year. In 2018, for the mobile internet only, access traffic reached 71.1 billion GB in China. Text compression technology mainly employs statistics-based and dictionary-based methods. These methods have distinct advantages and disadvantages, depending on the specific application, and they operate differently. Shannon–Fano coding uses a top-down building tree, which has low coding efficiency and long average coding length. It is rarely used in practical applications. Huffman coding encodes the sequence according to the probability of character occurrence, so that the average code length is the shortest This method has average compression efficiency for those characters with average probability of occurrence. The LZ78 algorithm uses a dynamic dictionary to store information, extracts character strings from the character stream, and represents them by numbers and encodes the repeated character strings

Related Research

Syllables of Uyghur

Syllable Segmentation and Analysis

Selection of High-Frequency Syllables

Syllable

Syllable Coding

B12 Coding Scheme

B16 Coding Scheme

B12 Scheme Flags

Length

B16 Scheme Flags

SDB was a

Data Compression Process

Compression Ratio

Compression

Average Coding Length

Decompression of B12 Scheme

Decompression of B16 Scheme

Experimental Corpus and Comparison Methods

Compression of Text of Different Sizes

Short-Text Compression

Method

Experimental Analysis

CRbest and CRworst of the B12 Scheme

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Information	Publication Date: Mar 23, 2020
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Syllable-Based Technique for Uyghur Text Compression

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information

Lead the way for us

Similar Papers

Fast LZW Compression Using a GPU
Shunji Funasaka ... Yasuaki Ito
-
Shunji Funasaka, et. al.Shunji Funasaka ... Yasuaki Ito
01 Dec 2015
01 Dec 2015

A study on the efficient compression algorithm of the voice/data integrated multiplexer
Gyoun-Yon Cho ... Dong-Ho Cho
-
Gyoun-Yon Cho, et. al. Gyoun-Yon Cho ... Dong-Ho Cho
18 Jun 1995
18 Jun 1995

A lossless compression method for logging data while drilling
Shan Song ... Aiping Wu
Systems Science & Control Engineering | VOL. 9
Shan Song, et. al.Shan Song ... Aiping Wu
01 Jan 2020
Systems Science & Control Engineering | VOL. 9

ANALISIS PERBANDINGAN ALGORITMA LZW DAN HUFFMAN PADA KOMPRESI FILE GAMBAR BMP DAN PNG
Andika Satyapratama ... Mahmud Yunus
Jurnal Teknologi Informasi | VOL. -
Andika Satyapratama, et. al.Andika Satyapratama ... Mahmud Yunus
31 Oct 2015
Jurnal Teknologi Informasi | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Syllable-Based Technique for Uyghur Text Compression

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information