Abstract

Scaling properties of language are a useful tool for understanding generative processes in texts. We investigate the scaling relations in citywise Twitter corpora coming from the metropolitan and micropolitan statistical areas of the United States. We observe a slightly superlinear urban scaling with the city population for the total volume of the tweets and words created in a city. We then find that a certain core vocabulary follows the scaling relationship of that of the bulk text, but most words are sensitive to city size, exhibiting a super- or a sublinear urban scaling. For both regimes, we can offer a plausible explanation based on the meaning of the words. We also show that the parameters for Zipf’s Law and Heaps' Law differ on Twitter from that of other texts, and that the exponent of Zipf’s Law changes with city size.

Highlights

  • The recent increase in digitally available language corpora made it possible to extend the traditional linguistic tools to a vast amount of often user-generated texts

  • We show that the number of new words needed in longer texts (Heaps’ Law [2]) exhibits a sublinear power-law form on Twitter, indicating a decelerating growth of distinct tokens with city size

  • Because most observations so far hold only for books or corpora that contain longer texts than tweets, our results suggest that the nature of communication, in our case, Twitter itself 10 affects the parameters of linguistic laws

Read more

Summary

Introduction

The recent increase in digitally available language corpora made it possible to extend the traditional linguistic tools to a vast amount of often user-generated texts. Understanding how these corpora differ from traditional texts is crucial in developing computational methods for web search, information retrieval or machine translation [1]. The amount of these texts enables the analysis of language on a previously unprecedented scale [2,3,4], including the dynamics, geography and time scale of language change [5,6], social media cursing habits [7,8,9] or dialectal variations [10]. Various studies have analysed spatial variation in the text of online social network messages and its applicability to several different questions, including user localization based on the content of their posts [17,18], empirical analysis of the geographical diffusion of novel words, phrases, trends and topics of interest [19,20], measuring public mood [21]

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.