Abstract
The Prediction by Partial Matching (PPM) compression algorithm is considered one of the most efficient methods for compressing natural language text. Despite the advances of the PPM method for the English language to predict upcoming symbols or words, more research is required to devise better compression methods for other languages, such as Arabic due, for example, to the rich morphological nature of the Arabic text, where a word can take many different forms. In this paper, we propose a new method that achieves the best compression rates not only for Arabic text but also for other languages that use Arabic script in their writing system such as Persian. Our word-based method constructs a context-free grammar (CFG) for the text and this grammar is then encoded using PPM to achieve excellent compression rates.
Highlights
The Prediction by Partial Matching (PPM) compression algorithm is one of the most effective kinds of statistical compression
We discuss the encoding execution times for GRW-PPM with and without using the full exclusions mechanism that PPM uses for its encoding
The GRW-PPM encoding is divided into four parts
Summary
The Prediction by Partial Matching (PPM) compression algorithm is one of the most effective kinds of statistical compression. Prediction in PPM depends on a bounded number of previous characters or symbols, effectively using a Markovbased approach. Despite the cost in terms of memory and the speed of execution, PPM usually attains better compression rates compared with other well-known compression methods. An escape probability estimates if a new symbol appears in the context [1], [2] and if an escape is encoded, the algorithm will back-off to a lower order model. The „full exclusions‟ mechanism [1] is used to significantly improve compression by excluding the prediction of higher order symbols when an escape has occurred since these characters were not encoded [17]. Experimental results show that not using full exclusions speeds up the execution time of programs but compression is reduced
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Advanced Computer Science and Applications
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.