Abstract
N-grams are generalized words consisting of N consecutive symbols (letters), as they are used in a text. N-word phrases are general concepts consisting of N consecutive words, also as used in a text. Given the rank-frequency function of single letters (i.e., one-grams) or of single words (i.e., one-word phrases) being Zipfian, we determine in this paper, the exact rank-frequency function (i.e., the occurrence of N-grams or N-word phrases on each rank) and size-frequency distribution (i.e., the density of N-grams or N-word phrases on each occurrence density) of these N-grams and N-word phrases. This paper distinguishes itself from other ones on this topic by allowing no approximations in the calculations. This leads to an intricate rank-frequency function for N-grams and N-word phrases (as we knew before from unpublished calculations) but leads surprisingly, to a very simple size-frequency function f N for N-grams or N-word phrases of the form f N ( j ) = F j 1 + 1 / β ln N − 1 ( G j ) , where the Zipfian distribution of single letters or words is proportional to 1/ r β . The paper closes with the calculation of type/token averages μ N and type/token-taken averages μ* N for N-grams and N-word phrases, where we also verify the theoretically proved result μ* N ≥ μ N but where we also give estimates for the differences μ* N − μ N .
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.