Gender prediction with descriptive textual data using a Machine Learning approach

Babatunde Onikoyi,Nonso Nnamoko,Ioannis Korkontzelos

doi:10.1016/j.nlp.2023.100018

Babatunde Onikoyi, Nonso Nnamoko + Show 1 more

Open Access

https://doi.org/10.1016/j.nlp.2023.100018

Copy DOI

Journal: Natural Language Processing Journal	Publication Date: Jun 9, 2023
Citations: 4	License type: cc-by

Affiliation: Edge Hill University

Abstract

Social media are well-established means of online communication, generating vast amounts of data. In this paper, we focus on Twitter and investigate behavioural differences between male and female users on social media. Using Natural Language Processing and Machine Learning approaches, we propose a user gender identification method that considers both the tweets and the Twitter profile description of a user. For experimentation and evaluation, we enriched and used an existing Twitter User Gender Classification dataset, which is freely available on Kaggle. We considered a variety of methods and components, such as the Bag of Words model, pre-trained word embeddings (GLOVE, BERT, GPT2 and Word2Vec) and machine learners, e.g., Naïve Bayes, Support Vector Machines and Random Forests. Evaluation results have shown that including the Twitter profile description of a user significantly improves gender classification accuracy, by 10% approximately. Stanford’s GLOVE embedding model, pre-trained on 2 billion tweets, 27 billion tokens and a vocabulary size of 1.2 million words, achieved the highest gender prediction accuracy, considering both the tweets and the profile description of a user. Statistical significance has been assessed using McNemar’s two-tailed test.

Full Text