Evaluation of Persian text based on Huffman data compression

Omid Jalilian,Alireza Rezvanian,Abolfazl Toroghi Haghighat

doi:10.1109/icat.2009.5348434

Abstract

According to the growth of information sources in recent years along the Web, many of Web servers have been dedicated to the information sources storage. Until yet many methods are presented for storing and transforming information on the Web in the case of paralleling or processing. But one of the researcher's challenges in derivation and restoring data in data mining and information retrievals are to face to this huge amount of information for storing. One of the solutions of this problem is compression of information resources. Notice that the published statistics, Persian language is one of the oldest and the most diffused languages all around the world and Web and also according to its kind of alphabets and variety along the Persian texts, an evaluation on compression for Persian texts will be useful. First of all in this paper variety difficulties and huge amount of information on the Web, general aspects of Huffman compression methods are introduced, and also some features of Persian language. The state of choosing Persian texts collections has been investigated and the result of tests in compare with some experimental datasets form Persian, English and Arabic were shown. The experimental results are given at the end of paper.

Full Text