Abstract
Large scale analysis and statistics of socio-technical systems that just a few short years ago would have required the use of consistent economic and human resources can nowadays be conveniently performed by mining the enormous amount of digital data produced by human activities. Although a characterization of several aspects of our societies is emerging from the data revolution, a number of questions concerning the reliability and the biases inherent to the big data “proxies” of social life are still open. Here, we survey worldwide linguistic indicators and trends through the analysis of a large-scale dataset of microblogging posts. We show that available data allow for the study of language geography at scales ranging from country-level aggregation to specific city neighborhoods. The high resolution and coverage of the data allows us to investigate different indicators such as the linguistic homogeneity of different countries, the touristic seasonal patterns within countries and the geographical distribution of different languages in multilingual regions. This work highlights the potential of geolocalized studies of open data sources to improve current analysis and develop indicators for major social phenomena in specific communities.
Highlights
Modern life, with its increasing reliance on digital technologies, is opening unanticipated opportunities for the study of human behavior and large scale societal trends
At the same time it is crucial to understand to which extent the picture of socio-technical systems emerging from digital data proxies is statistically sound and how well it does scale to a planetary dimension [15]
We investigate several relevant examples in language geography and explore the temporal dimension for seasonal patterns
Summary
With its increasing reliance on digital technologies, is opening unanticipated opportunities for the study of human behavior and large scale societal trends. Mobile clients for microblogging platforms, social networking tools, and other ‘‘proxy’’ data of human activity collected in the web allow for the quantitative analysis of social systems at a scale that would have been unimaginable just a few years ago [3,4,5,6]. We perform a comprehensive survey of the worldwide linguistic landscape as emerging from mining the Twitter microblogging platform. Our large-scale dataset, gathered over approximately two years, at an average rate of 6:5|105 GPStagged tweets per day, contains information about almost 6 million users and provides a uniquely fine-grained survey of worldwide linguistic trends. By coupling the geographical layer to the identification of the language of single tweets we are able to determine the detailed language geography of more than 100 countries worldwide [16]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.