Abstract

AbstractRecently, numerous approaches have emerged in the social sciences to exploit the opportunities made possible by the vast amounts of data generated by online social networks (OSNs). Having access to information about users on such a scale opens up a range of possibilities—from predicting individuals’ demographics and health status to their beliefs and political opinions—all without the limitations associated with often slow and expensive paper-based polls. A question that remains to be satisfactorily addressed, however, is how demography is represented in OSN content—that is, what are the relevant aspects that constitute detectable large-scale patterns in language? Here, we study language use in the United States using a corpus of text compiled from over half a billion geotagged messages from the online microblogging platform Twitter. Our intention is to reveal the most important spatial patterns in language use in an unsupervised manner and relate them to demographics. Our approach is based on Latent Semantic Analysis augmented with the Robust Principal Component Analysis methodology, which permits identification of the data’s main sources of variation with an automatic filtering of noise and outliers without influencing results by a priori assumptions. We find spatially correlated patterns that can be interpreted based on the words associated with them. The main language features can be related to slang use, urbanization, travel, religion and ethnicity, the patterns of which are shown to correlate plausibly with traditional census data. Apart from the standard measure of linear correlation, some relations seem to be better explained by Boolean implications, suggesting a threshold-like behaviour where demographic variables influence the users’ word use. Our findings validate the concept of demography being represented in OSN language use and show that the traits observed are inherently present in the word frequencies without any previous assumptions about the dataset. They therefore could form the basis of further research focusing on the evaluation of demographic data estimation from other big data sources, or on the dynamical processes that result in the patterns identified here.

Highlights

  • Geography plays an important role in many social phenomena: clearly, many aspects of life are influenced by the possibilities offered by the environment in which one lives (Quillian, 1999; Brain, 2005; Bruch and Mare, 2006; Iceland and Wilkes, 2006; Bettencourt et al, 2007; Sampson, 2009)

  • Two common data sources are mobile phone networks, where user activity and aggregated measures of network utilization are recorded at the antenna level as part of regular operation (Blondel et al, 2015), and online social networks (OSNs) (Mislove, 2009), where the content publicly shared by users in many cases includes their position (Cheng et al, 2011)

  • In this study our goal is to analyse in an unsupervised manner how and to what extent regional-scale demographic attributes are represented in social media posts

Read more

Summary

Introduction

Geography plays an important role in many social phenomena: clearly, many aspects of life are influenced by the possibilities offered by the environment in which one lives (Quillian, 1999; Brain, 2005; Bruch and Mare, 2006; Iceland and Wilkes, 2006; Bettencourt et al, 2007; Sampson, 2009). In the past two decades, there has been significant growth in the amount of data collected about individuals that has been made available for research purposes This has had a large impact on social science research where empirical studies were previously limited by the cost and effort associated with data collection. This includes studies focusing on how modern data collection methods can be used to reveal the spatial structure in society on several scales, and how quantities measured in the online or abstract environments are connected to real-world phenomena. Some other data sources with promising application possibilities include monetary transactions (Brockmann et al, 2006; Thiemann et al, 2010; Sobolevsky et al, 2016), GPS traces from cars (Pappalardo et al, 2013, 2015), and other devices and public transportation usage as recorded by electronic payment systems (Roth et al, 2011; Hasan et al, 2013)

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.