Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts

Sven Rahmann,Eric Rivals

doi:10.1007/3-540-45123-4_31

Abstract

The number of missing words (NMW) of length q in a text, and the number of common words (NCW) of two texts are useful text statistics. Knowing the distribution of the NMW in a random text is essential for the construction of so-called monkey tests for pseudorandom number generators. Knowledge of the distribution of the NCW of two independent random texts is useful for the average case analysis of a family of fast pattern matching algorithms, namely those which use a technique called q-gram filtration. Despite these important applications, we are not aware of any exact studies of these text statistics. We propose an efficient method to compute their expected values exactly. The difficulty of the computation lies in the strong dependence of successive words, as they overlap by (q-1) characters. Our method is based on the enumeration of all string autocorrelations of length q, i.e., of the ways a word of length q can overlap itself. For this, we present the first efficient algorithm. Furthermore, by assuming the words are independent, we obtain very simple approximation formulas, which are shown to be surprisingly good when compared to the exact values.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Common words between two random strings
Philippe Jacquet
-
Philippe JacquetPhilippe Jacquet
23 May 2006
23 May 2006

An improved approach to word sense disambiguation
Pradeep Sachdeva ... Surbhi Verma
-
Pradeep Sachdeva, et. al.Pradeep Sachdeva ... Surbhi Verma
01 Dec 2014
01 Dec 2014

String similarity measures and joins with synonyms
Jiaheng Lu ... Chen Li
-
Jiaheng Lu, et. al.Jiaheng Lu ... Chen Li
22 Jun 2013
22 Jun 2013

A simple approximate formula for the physical focal length of spherically focused transducers
J.H Huang ... Desheng Ding
IEEE Transactions on Ultrasonics, Ferroelectrics and Frequency Control | VOL. 56
J.H Huang, et. al.J.H Huang ... Desheng Ding
01 Dec 2009
IEEE Transactions on Ultrasonics, Ferroelectrics and Frequency Control | VOL. 56

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts

Abstract

Talk to us

Similar Papers