Abstract

The recent dramatic increase in online data availability has allowed researchers to explore human culture with unprecedented detail, such as the growth and diversification of language. In particular, it provides statistical tools to explore whether word use is similar across languages, and if so, whether these generic features appear at different scales of language structure. Here we use the Google Books $N$-grams dataset to analyze the temporal evolution of word usage in several languages. We apply measures proposed recently to study rank dynamics, such as the diversity of $N$-grams in a given rank, the probability that an $N$-gram changes rank between successive time intervals, the rank entropy, and the rank complexity. Using different methods, results show that there are generic properties for different languages at different scales, such as a core of words necessary to minimally understand a language. We also propose a null model to explore the relevance of linguistic structure across multiple scales, concluding that $N$-gram statistics cannot be reduced to word statistics. We expect our results to be useful in improving text prediction algorithms, as well as in shedding light on the large-scale features of language use, beyond linguistic and cultural differences across human populations.

Highlights

  • The recent availability of large datasets on language, music, and other cultural constructs has allowed the study of human culture at a level never possible before, opening the data-driven field of culturomics [1,2,3,4,5,6,7,8,9,10,11,12,13]

  • The behavior of these curves is similar for all languages: N-grams in low ranks change their position less than N-grams in higher ranks, yielding a sigmoid rank diversity d(k) (Figure 2)

  • Our statistical analysis suggests that human language is an example of a cultural construct where macroscopic statistics cannot be deduced from microscopic statistics (1-grams)

Read more

Summary

Introduction

The recent availability of large datasets on language, music, and other cultural constructs has allowed the study of human culture at a level never possible before, opening the data-driven field of culturomics [1,2,3,4,5,6,7,8,9,10,11,12,13]. Digitalized data and computational algorithms allow us to tackle these problems with a stronger statistical basis [14]. From the 2012 update of this public dataset, we measure frequencies per year of words (1-grams), pairs of words (2-grams), up until N-grams with N = 5 for several languages, and focus on how scale (as measured by N) determines the statistical and temporal characteristics of language structure.

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.