Zipf's and Benford's laws in Twitter hashtags

José Alberto Pérez-Melián,J Alberto Conejero,Cesar Ferri Ramírez

doi:10.18653/v1/e17-4009

Abstract

Social networks have transformed communication dramatically in recent years through the rise of new platforms and the development of a new language of communication. This landscape requires new forms to describe and predict the behaviour of users in networks. This paper presents an analysis of the frequency distribution of hashtag popularity in Twitter conversations. Our objective is to determine if these frequency distribution follow some well-known frequency distribution that many real-life sets of numerical data satisfy. In particular, we study the similarity of frequency distribution of hashtag popularity with respect to Zipf’s law, an empirical law referring to the phenomenon that many types of data in social sciences can be approximated with a Zipfian distribution. Additionally, we also analyse Benford’s law, is a special case of Zipf’s law, a common pattern about the frequency distribution of leading digits. In order to compute correctly the frequency distribution of hashtag popularity, we need to correct many spelling errors that Twitter’s users introduce. For this purpose we introduce a new filter to correct hashtag mistake based on string distances. The experiments obtained employing datasets of Twitter streams generated under controlled conditions show that Benford’s law and Zipf’s law can be used to model hashtag frequency distribution.

Highlights

Twitter is a microblogging social network launched in 2006 with 310 million active users per month and where 340 million tweets are daily generated1
We find examples of this use in electoral processes, accounting fraud detection, scientific fraud detection
We confirm that the distribution of hashtag frequency follows a power law, as Zipf’s law expects

Summary

Introduction

Twitter is a microblogging social network launched in 2006 with 310 million active users per month and where 340 million tweets are daily generated. Twitter users can interact between them by using the @ symbol followed by the username they want to mention. They can classify tweets in more than one category or theme by using hashtags (alphanumeric strings preceded by #). To the best of our knowledge, there are not studies about the frequency distribution of hashtag popularity in Twitter conversations. Our goal is to analyse Twitter datasets in order to discover if the the frequency of hashtags popularity follow some of the distribution laws that are very common in many types of data presented in the social sciences.

Objectives

Methods

Findings

Conclusion