Abstract

Text is one of the most prevalent types of digital data that people create as they go about their lives. Digital footprints of people's language usage in social media posts were found to allow for inferences of their age and gender. However, the even more prevalent and potentially more sensitive text from instant messaging services has remained largely uninvestigated. We analyze language variations in instant messages with regard to individual differences in age and gender by replicating and extending the methods used in prior research on social media posts. Using a dataset of 309,229 WhatsApp messages from 226 volunteers, we identify unique age- and gender-linked language variations. We use cross-validated machine learning algorithms to predict volunteers' age (MAEMd = 3.95, rMd = 0.81, R2Md = 0.49) and gender (AccuracyMd = 85.7%, F1Md = 0.67, AUCMd = .82) significantly above baseline-levels and identify the most predictive language features. We discuss implications for psycholinguistic theory, present opportunities for application in author profiling, and suggest methodological approaches for making predictions from small text data sets. Given the recent trend towards the dominant use of private messaging and increasingly weaker user data protection, we highlight rising threats to individual privacy rights in instant messaging.

Highlights

  • When texting a friend on WhatsApp, posting on Facebook, tweeting on Twitter, or writing a blog post, we inevitably leave behind digital footprints in the form of text data

  • In a similar manner to prior studies based on social media posts, this work aims to create insights into age- and gender-linked linguistic variations and explore how accurately information on user de­ mographics can be inferred from instant messages

  • We found a range of age- and gender-linked language variations in our data

Read more

Summary

Introduction

When texting a friend on WhatsApp, posting on Facebook, tweeting on Twitter, or writing a blog post, we inevitably leave behind digital footprints in the form of text data. Research in the domain of author profiling has shown that language characteristics of Facebook status updates (Jaidka et al, 2018; Sap et al, 2014; Schwartz et al, 2013), tweets (Bamman et al, 2014; Burger et al, 2011; Jaidka et al, 2018; Rao et al, 2010; Sap et al, 2014), and blog posts (Argamon et al, 2007; Sap et al, 2014; Schler et al, 2006) allow for the accurate inference of the authors’ age and gender These social media studies extended the theory of gender- and age-linked language variations (Park et al, 2016). In a similar manner to prior studies based on social media posts, this work aims to create insights into age- and gender-linked linguistic variations and explore how accurately information on user de­ mographics can be inferred from instant messages

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.