Abstract
Languages employ different strategies to transmit structural and grammatical information. While, for example, grammatical dependency relationships in sentences are mainly conveyed by the ordering of the words for languages like Mandarin Chinese, or Vietnamese, the word ordering is much less restricted for languages such as Inupiatun or Quechua, as these languages (also) use the internal structure of words (e.g. inflectional morphology) to mark grammatical relationships in a sentence. Based on a quantitative analysis of more than 1,500 unique translations of different books of the Bible in almost 1,200 different languages that are spoken as a native language by approximately 6 billion people (more than 80% of the world population), we present large-scale evidence for a statistical trade-off between the amount of information conveyed by the ordering of words and the amount of information conveyed by internal word structure: languages that rely more strongly on word order information tend to rely less on word structure information and vice versa. Or put differently, if less information is carried within the word, more information has to be spread among words in order to communicate successfully. In addition, we find that–despite differences in the way information is expressed–there is also evidence for a trade-off between different books of the biblical canon that recurs with little variation across languages: the more informative the word order of the book, the less informative its word structure and vice versa. We argue that this might suggest that, on the one hand, languages encode information in very different (but efficient) ways. On the other hand, content-related and stylistic features are statistically encoded in very similar ways.
Highlights
Natural languages employ different strategies to transmit the information that is necessary to recover specific aspects of the corresponding message
For example, grammatical information ("who did what to whom") in a sentence is mainly conveyed by the ordering of the words in languages like Mandarin Chinese or Vietnamese, the word ordering is much less restricted for languages like Inupiatun or Quechua, as these
The statistical trade-off between word order and word structure languages () use the internal structure of words as cues to inform about grammatical relationships in a sentence
Summary
Natural languages employ different strategies to transmit the information that is necessary to recover specific aspects of the corresponding message (e.g. grammatical relations, thematic roles, agreement, and more generally, the encoding of grammatical categories). (NB.: A standard off-the-shelf file compressor is able to compress the original string to less than 25% of its original size by using more sophisticated compression methods than the one described above.) Our compression scheme works, because there are a few words that are repeated very often throughout the text. Those words receive shorter “codes” (in those cases: smaller integers). Apart from providing insights into the cognitive organization of natural languages, this demonstrates that statistical information plays a vital role on many different levels of linguistic structure. These different aspects of statistical information can be used to compress natural language data
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.