Abstract

Social media are well-established means of online communication, generating vast amounts of data. In this paper, we focus on Twitter and investigate behavioural differences between male and female users on social media. Using Natural Language Processing and Machine Learning approaches, we propose a user gender identification method that considers both the tweets and the Twitter profile description of a user. For experimentation and evaluation, we enriched and used an existing Twitter User Gender Classification dataset, which is freely available on Kaggle. We considered a variety of methods and components, such as the Bag of Words model, pre-trained word embeddings (GLOVE, BERT, GPT2 and Word2Vec) and machine learners, e.g., Naïve Bayes, Support Vector Machines and Random Forests. Evaluation results have shown that including the Twitter profile description of a user significantly improves gender classification accuracy, by 10% approximately. Stanford’s GLOVE embedding model, pre-trained on 2 billion tweets, 27 billion tokens and a vocabulary size of 1.2 million words, achieved the highest gender prediction accuracy, considering both the tweets and the profile description of a user. Statistical significance has been assessed using McNemar’s two-tailed test.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call