Abstract
A simpler model is proposed for estimating the frequency of any same-frequency words and identifying the boundary point between high-frequency words and low-frequency words in a text. The model, based on a “maximum ranking method,” assigns ranks to the words and estimates word frequency by the formula: Int[(−1 + (1 + 4D/In+1)1/2)/2] > n* ≥ Int[(−1 + (1 + 4D/In)1/2)/2]. The boundary value between high-frequency and low-frequency words is obtained by taking the square root of the number of different words in the text: n* = (D)1/2. This straightforward model was used successfully with both English and Chinese texts, demonstrating that the frequency of words and the number of same-frequency words are dependent only on the vocabulary of a text (the number of different words) but not on its length. Like Zipf's Law, the model may be universally applicable.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have