Multilingual evaluation of pre-processing for BERT-based sentiment analysis of tweets

Marco Pota,Mirko Ventura,Hamido Fujita,Massimo Esposito

doi:10.1016/j.eswa.2021.115119

Abstract

Social media offer a big amount of information, to exploit in many fields of research. However, while methods for Natural Language Processing are being developed with good results when applied to well-formed datasets made of written text with a clear syntax, these sources present text written in informal language, unstructured syntax, and with peculiar symbols; therefore, particular approaches are required for text processing in this case. In this paper, the task of sentiment analysis of tweets is regarded. In particular, in order to avoid noise constituted by some web constructs like URLs and mentions and by other text fragments, and to exploit information hidden in symbols like emoticons, emojis and hashtags, the pre-processing of tweets is analyzed. More in detail, a number of experiments, performed by a state-of-the-art classification model (BERT), are designed, to evaluate many currently available operations for pre-processing tweets, in terms of the statistical significance of their influence on sentiment analysis performances. Moreover, available data in two languages are considered, i.e., English and Italian, in order to also evaluate dependence on the language. Results allow to individuate the most convenient strategy to pre-process tweets, and thus to improve the state of the art in both languages for the considered task of sentiment analysis.

Full Text