A model for estimating the occurrence of same‐frequency words and the boundary between high‐ and low‐frequency words in texts

Qinglan Sun,Charles H Davis,Debora Shaw

doi:10.1002/(sici)1097-4571(1999)50:3<280::aid-asi11>3.3.co;2-8

Abstract

A simpler model is proposed for estimating the frequency of any same-frequency words and identifying the boundary point between high-frequency words and low-frequency words in a text. The model, based on a maximum ranking method, assigns ranks to the words and estimates word frequency by the formula: Int[(-1 + (1 + 4D/I n+1 ) 1/2 )/2] > n * ≥ Int[(-1 + (1 + 4D/I n ) 1/2 )/2]. The boundary value between high-frequency and low-frequency words is obtained by taking the square root of the number of different words in the text: n * = (D) 1/2 . This straighfforward model was used successfully with both English and Chinese texts, demonstrating that the frequency of words and the number of same-frequency words are dependent only on the vocabulary of a text (the number of different words) but not on its length. Like Zipf's Law, the model may be universally applicable.

Full Text