Generation of a Model for Grapheme Frequencies and its Refinement and Validation by Group Theoretic Aspects

Hemlata Pande,H S Dhami

doi:10.1080/09296170903211485

Abstract

The occurrences of graphemes in a text are generally determined by Zipf's law. In an attempt to develop a theoretical model for grapheme frequencies, Grzybek and Kelih have tested different distribution models and have come to the conclusion that rank frequency distribution for Slavic languages can be expressed in the form of the negative hypergeometric distribution. The application of this distribution to different corpora has led us to derive a functional relationship between ranks and letters of the English language alphabet and thus has formed a platform for the present study. In order to identify the patterns of letters in the corpus, we have applied group theoretical aspects and have observed that different rings are generated corresponding to ranks 1, 2 having values in the range 23–26, fields for ranks in ranges 3–9 and 10–22. Applications of these rings and fields reveal that frequency distribution can always be fitted by locally adopting an equation in the sets. It has led us to generate a general model for rank frequency distribution of English texts.

Full Text