Politically-oriented information inference from text

Samuel Caetano Da Silva,Ivandre Paraboni

doi:10.3897/jucs.96652

Abstract

The inference of politically-oriented information from text data is a popular research topic in Natural Language Processing (NLP) at both text- and author-level. In recent years, studies of this kind have been implemented with the aid of text representations ranging from simple count-based models (e.g., bag-of-words) to sequence-based models built from transformers (e.g., BERT). Despite considerable success, however, we may still ask whether results may be improved further by combining these models with additional text representations. To shed light on this issue, the present work describes a series of experiments to compare a number of strategies for political bias and ideology inference from text data using sequence-based BERT models, syntax-and semantics-driven features, and examines which of these representations (or their combinations) improve overall model accuracy. Results suggest that one particular strategy - namely, the combination of BERT language models with syntactic dependencies - significantly outperforms well-known count- and sequence-based text classifiers alike. In particular, the combined model has been found to improve accuracy across all tasks under consideration, outperforming the SemEval hyperpartisan news detection top-performing system by up to 6%, and outperforming the use of BERT alone by up to 21%, making a potentially strong case for the use of heterogeneous text representations in the present tasks.

Full Text