Coming to Grips with Age Prediction on Imbalanced Multimodal Community Question Answering Data

Alejandro Figueroa,Billy Peralta,Orietta Nicolis

doi:10.3390/info12020048

Alejandro Figueroa, Billy Peralta + Show 1 more

Open Access

https://doi.org/10.3390/info12020048

Copy DOI

Journal: Information	Publication Date: Jan 21, 2021
Citations: 15	License type: CC BY 4.0

Affiliation: Universidad Andrés Bello

Abstract

For almost every online service, it is fundamental to understand patterns, differences and trends revealed by age demographic analysis—for example, take the discovery of malicious activity, including identity theft, violation of community guidelines and fake profiles. In the particular case of platforms such as Facebook, Twitter and Yahoo! Answers, user demographics have impacts on their revenues and user experience; demographics assist in ensuring that the needs of each cohort are fulfilled via personalizing and contextualizing content. Despite the fact that technology has been made more accessible, thereby becoming evermore prevalent in both personal and professional lives alike, older people continue to trail Gen Z and Millennials in its adoption. This trailing brings about an under-representation that has a harmful influence on the demographic analysis and on supervised machine learning models. To that end, this paper pioneers attempts at examining this and other major challenges facing three distinct modalities when dealing with community question answering (cQA) platforms (i.e., texts, images and metadata). As for textual inputs, we propose an age-batched greedy curriculum learning (AGCL) approach to lessen the effects of their inherent class imbalances. When built on top of FastText shallow neural networks, AGCL achieved an increase of ca. 4% in macro-F1-score with respect to baseline systems (i.e., off-the-shelf deep neural networks). With regard to metadata, our experiments show that random forest classifiers significantly improve their performance when individuals close to generational borders are excluded (up to 20% more accuracy); and by experimenting with neural network-based visual classifiers, we discovered that images are the most challenging modality for age prediction. In fact, it is hard for a visual inspection to connect profile pictures with age cohorts, and there are considerable differences in their group distributions with respect to meta-data and textual inputs. All in all, we envisage that our findings will be highly relevant as guidelines for constructing assorted multimodal supervised models for automatic age recognition across cQA platforms.

Highlights

There is no question that demographic analysis is essential for running a successful social media network
As for textual inputs, we propose an age-batched greedy curriculum learning (AGCL) approach to lessen the effects of their inherent class imbalances
Our experiments show that random forest classifiers significantly improve their performance when individuals close to generational borders are excluded; and by experimenting with neural network-based visual classifiers, we discovered that images are the most challenging modality for age prediction

Summary

Introduction

There is no question that demographic analysis is essential for running a successful social media network. In essence, this analysis is considered virtually indispensable for engaging members on an individual level, and for building social capital. People at different ages have distinct ways of expressing themselves and often spend their time on separate platforms. Consider the case of Millennials, who may spend most of their time on Instagram and Facebook, whereas older people prefer relying heavily on their email inboxes. This can be found on community question answering (cQA) sites such as Yahoo! Aside from that, another report mentions that Yahoo! Answers had enrolled about one hundred million fellows as of December 2015 [2]

Methods

Discussion

Conclusion