Richer Document Embeddings for Author Profiling tasks based on a heuristic search

Roberto López-Santillán,Manuel Montes-Y-Gómez,Luis Carlos González-Gurrola,Graciela Ramírez-Alonso,Olanda Prieto-Ordaz

doi:10.1016/j.ipm.2020.102227

Abstract

In this study we propose a novel method to generate Document Embeddings (DEs) by means of evolving mathematical equations that integrate classical term frequency statistics. To accomplish this, we employed a Genetic Programming (GP) strategy to build competitive formulae to weight custom Word Embeddings (WEs), produced by cutting edge feature extraction techniques (e.g., word2vec, fastText, BERT), and then we create DEs by their weighted averaging. We exhaustively evaluated the proposed method over 9 datasets that are composed of several multilingual social media sources, with the aim to predict personal attributes of authors (e.g., gender, age, personality traits) in 17 tasks. In each dataset we contrast the results obtained by our method against state-of-the-art competitors, placing our approach at the top-quartile in all cases. Furthermore, we introduce a new numerical statistic feature called Relevance Topic Value (rtv), which could be used to enhance the forecasting of characteristics of authors, by numerically describing the topic of a document and the personal use of words by users. Interestingly, based on a frequency analysis of terminals used by GP, rtv turned out to be the most likely feature to appear alone in a single equation, then suggesting its usefulness as a WE weighting scheme.

Full Text