Abstract

Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) is more strongly correlated with word length, although this tendency is not observed consistently, depending on several methodological choices. The present study examines a more diverse sample of languages than the previous studies (Arabic, Finnish, Hungarian, Indonesian, Russian, Spanish and Turkish). I use large web-based corpora from the Leipzig Corpora Collection to estimate word lengths in UTF-8 characters and in phonemes (for some of the languages), as well as word frequency, informativity given previous word and informativity given next word, applying different methods of bigrams processing. The results show different correlations between word length and the corpus-based measure for different languages. I argue that these differences can be explained by the properties of noun phrases in a language, most importantly, by the order of heads and modifiers and their relative morphological complexity, as well as by orthographic conventions.

Highlights

  • One of the most famous generalizations in linguistics is Zipf’s law of abbreviation [1,2].It states that more frequent words tend to be shorter than less frequent words

  • In Indonesian does informativity given previous word consistently have a stronger correlation than frequency

  • In Russian, the small advantage of previous-word informativity disappears in the cleaned data, such that the difference between correlation coefficients based on informativity and frequency is no longer statistically significant

Read more

Summary

Introduction

One of the most famous generalizations in linguistics is Zipf’s law of abbreviation [1,2].It states that more frequent words tend to be shorter than less frequent words. Highly frequent words like it, go and nice are shorter than entity, locomote and agreeable This law has been tested in corpora of 986 languages from 80 different families, and a negative correlation between frequency and length was observed in all of them, demonstrating that the law of abbreviation is an exceptionless language universal [3]. The negative correlation between frequency and length is likely to be a result of a mostly unconscious process of shortening a linguistic form when its meaning is highly accessible [5]. This is a manifestation of communicative efficiency [6,7]. Experimental evidence shows that learners of an artificial language optimize form-meaning mappings, choosing shorter forms for more frequent meanings under pressure for accuracy and pressure for saving time and effort [8]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call