Abstract

Abstract Much previous work characterizing language variation across Internet social groups has focused on the types of words used by these groups. We extend this type of study by employing BERT to characterize variation in the senses of words as well, analyzing two months of English comments in 474 Reddit communities. The specificity of different sense clusters to a community, combined with the specificity of a community’s unique word types, is used to identify cases where a social group’s language deviates from the norm. We validate our metrics using user-created glossaries and draw on sociolinguistic theories to connect language variation with trends in community behavior. We find that communities with highly distinctive language are medium-sized, and their loyal and highly engaged users interact in dense networks.

Highlights

  • Internet language is often popularly characterized as a messy variant of “standard” language (Desta, 2014; Magalhães, 2019)

  • A line of future work suggested by Del Tredici and Fernández (2017) is extending studies on semantic variation to a larger set of communities, which our present work aims to achieve

  • We develop and evaluate word sense induction models using SemEval WSI tasks in a manner that is designed to parallel their later use on larger Reddit data

Read more

Summary

Introduction

Internet language is often popularly characterized as a messy variant of “standard” language (Desta, 2014; Magalhães, 2019). Work in sociolinguistics has demonstrated that online language is not homogeneous (Herring and Paolillo, 2006; Nguyen et al, 2016; Eisenstein, 2013). Instead, it expresses immense amounts of variation, often driven by social variables. A word such as python in Figure 1 has different usages depending on the community in which it is used. Our work examines both lexical and semantic variation, and operationalizes the study of the latter using BERT (Devlin et al, 2019)

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.