Abstract

A simpler model is proposed for estimating the frequency of any same-frequency words and identifying the boundary point between high-frequency words and low-frequency words in a text. The model, based on a maximum ranking method, assigns ranks to the words and estimates word frequency by the formula: Int[(-1 + (1 + 4D/I n+1 ) 1/2 )/2] > n * ≥ Int[(-1 + (1 + 4D/I n ) 1/2 )/2]. The boundary value between high-frequency and low-frequency words is obtained by taking the square root of the number of different words in the text: n * = (D) 1/2 . This straighfforward model was used successfully with both English and Chinese texts, demonstrating that the frequency of words and the number of same-frequency words are dependent only on the vocabulary of a text (the number of different words) but not on its length. Like Zipf's Law, the model may be universally applicable.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call