Abstract

The concept of the entropy of natural languages, first introduced by Shannon [A mathematical theory of communications, Bell Syst. Tech. J. 27, 379–423 (1948)] and its significance is discussed. A review of various known approaches to and results of previous studies of language entropy is presented. A new improved method for evaluation of both lower and upper bounds of the entropy of printed texts is developed. This method is a refinement of Shannon's prediction (guessing) method [Shannon, Prediction and entropy of printed English, Bell Syst. Tech. J. 30, 54–64 (1951)]. The evaluation of the lower bound is shown to be a classical linear programming problem. Statistical analysis of the estimation of the bounds is given and procedures for the statistical treatment of the experimental data (including verification of statistical validity and sigficance) are elaborated. The method has been applied to printed Hebrew texts in a large experiment (1000 independent samples) in order to evaluate entropy and other information-theoretical characteristics of the Hebrew language. The results have demonstrated the efficiency of the new method: the gap between the upper and lower bounds of entropy has been reduced by a factor of 2.25 compared to the original Shannon approach. Comparison with other languages is given. Possible applications of the method are briefly discussed.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.