Abstract

Knowledge of probability distribution of frequency of words and phrases is important in solving many practical problems such as estimation of semantic similarity between words, detection of semantic changes and others. It is assumed by many researchers that frequencies of words obey the Poisson law. However, there is much evidence that the Poisson distribution describes empirical data unsatisfactory. The analysis of the probability law in this case is greatly complicated by the fact that series of frequencies in most cases are non-stationary. This paper discusses the distribution law of time series of word frequencies based on the Google Books Ngram corpus data. It is shown that the correlation between the first moments of the frequencies differs from that expected in the assumption of the Poisson distribution. In particular, anomalously high values of frequency dispersion are observed for words with high frequency. To check the significance of deviations from the Poisson law, statistical modeling of frequency series was performed.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call