Rank Dynamics of Word Usage at Multiple Scales

José A Morales,Sergio Sánchez,Carlos Pineda,Jorge Flores,Germinal Cocho,Carlos Gershenson,Gerardo Iñiguez,Fernanda Sánchez-Puig,Ewan Colman

doi:10.3389/fphy.2018.00045

Abstract

The recent dramatic increase in online data availability has allowed researchers to explore human culture with unprecedented detail, such as the growth and diversification of language. In particular, it provides statistical tools to explore whether word use is similar across languages, and if so, whether these generic features appear at different scales of language structure. Here we use the Google Books $N$-grams dataset to analyze the temporal evolution of word usage in several languages. We apply measures proposed recently to study rank dynamics, such as the diversity of $N$-grams in a given rank, the probability that an $N$-gram changes rank between successive time intervals, the rank entropy, and the rank complexity. Using different methods, results show that there are generic properties for different languages at different scales, such as a core of words necessary to minimally understand a language. We also propose a null model to explore the relevance of linguistic structure across multiple scales, concluding that $N$-gram statistics cannot be reduced to word statistics. We expect our results to be useful in improving text prediction algorithms, as well as in shedding light on the large-scale features of language use, beyond linguistic and cultural differences across human populations.

Highlights

The recent availability of large datasets on language, music, and other cultural constructs has allowed the study of human culture at a level never possible before, opening the data-driven field of culturomics [1,2,3,4,5,6,7,8,9,10,11,12,13]
The behavior of these curves is similar for all languages: N-grams in low ranks change their position less than N-grams in higher ranks, yielding a sigmoid rank diversity d(k) (Figure 2)
Our statistical analysis suggests that human language is an example of a cultural construct where macroscopic statistics cannot be deduced from microscopic statistics (1-grams)

Summary

Introduction

The recent availability of large datasets on language, music, and other cultural constructs has allowed the study of human culture at a level never possible before, opening the data-driven field of culturomics [1,2,3,4,5,6,7,8,9,10,11,12,13]. Digitalized data and computational algorithms allow us to tackle these problems with a stronger statistical basis [14]. From the 2012 update of this public dataset, we measure frequencies per year of words (1-grams), pairs of words (2-grams), up until N-grams with N = 5 for several languages, and focus on how scale (as measured by N) determines the statistical and temporal characteristics of language structure.

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Physics	Publication Date: May 22, 2018
Citations: 13	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Rank Dynamics of Word Usage at Multiple Scales

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Physics

Lead the way for us

Similar Papers

Some features of language use in Yoruba traditional medicine
Wale Adegbite
African Languages and Cultures | VOL. 6
Wale AdegbiteWale Adegbite
01 Jan 1992
African Languages and Cultures | VOL. 6

Ciao, professoressa! A Study of Forms of Address in Italian and Its Implications for the Language Classroom
Diane Musumeci
Italica | VOL. 68
Diane MusumeciDiane Musumeci
01 Jan 1991
Italica | VOL. 68

Assessing Diverse Students With Autism Spectrum Disorders
Tina Taylor Dyches
The ASHA Leader | VOL. 16
Tina Taylor DychesTina Taylor Dyches
01 Jan 2010
The ASHA Leader | VOL. 16

Disciplinary differences in the use of English in higher education: reflections on recent language policy developments
Maria Kuteeva ... John Airey
Higher Education | VOL. 67
Maria Kuteeva, et. al.Maria Kuteeva ... John Airey
04 Sep 2013
Higher Education | VOL. 67

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Rank Dynamics of Word Usage at Multiple Scales

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Physics