Abstract

In this study we propose a novel method to generate Document Embeddings (DEs) by means of evolving mathematical equations that integrate classical term frequency statistics. To accomplish this, we employed a Genetic Programming (GP) strategy to build competitive formulae to weight custom Word Embeddings (WEs), produced by cutting edge feature extraction techniques (e.g., word2vec, fastText, BERT), and then we create DEs by their weighted averaging. We exhaustively evaluated the proposed method over 9 datasets that are composed of several multilingual social media sources, with the aim to predict personal attributes of authors (e.g., gender, age, personality traits) in 17 tasks. In each dataset we contrast the results obtained by our method against state-of-the-art competitors, placing our approach at the top-quartile in all cases. Furthermore, we introduce a new numerical statistic feature called Relevance Topic Value (rtv), which could be used to enhance the forecasting of characteristics of authors, by numerically describing the topic of a document and the personal use of words by users. Interestingly, based on a frequency analysis of terminals used by GP, rtv turned out to be the most likely feature to appear alone in a single equation, then suggesting its usefulness as a WE weighting scheme.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.