Abstract

We have developed methods for storing and retrieving large dictionaries of word pairs and other multi-word phrases based on hashed indexing. From analysis of text samples we have derived Zipfian laws for the frequency distributions of word pairs and longer phrases. We show where these Zipfian curves cross and deduce that the number of multi-word phrases which occur frequently in text is surprisingly small, of the same order of magnitude as the number of individual word-types in a text. Dictionaries of phrases are therefore amenable to fast processing with modest computer equipment. Finally, we suggest that in stylistic analysis word phrases might better discriminate between authors than do single words.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call