Classifying news articles in multiple languages: leveraging context aware models

Riccardo D'Ercole,Ivana Ilic Mestric,Giavid Valiyev,Michael Street,Peter Lenk

doi:10.1016/j.procs.2022.09.011

Abstract

Despite the recent advances in text classification and the performance improvement yielded by Transformers models, the absence or inaccessibility of an adequate dataset to train a text classifier motivates the choice for alternative routes. In this study, the need to detect specific topics in the news and to discard irrelevant content encouraged the development of an article tagging pipeline which assesses the similarity between a user-defined dictionary of topic-specific keywords and news article keywords. The innovation of the paper stands in the exploitation of two BERT-based algorithms to retrieve article keywords and to embed them, which previous studies have shown to outperform state of the art solutions for keywords extraction and semantic textual similarity. In a nutshell, the pipeline computes the semantic similarity between the sentence embeddings generated from topic-specific keywords and those produced from news article keywords extracted with the KeyBERT algorithm, finally classifying each article according to a previously defined topic. The results are supported by sound coherence and diversity metrics computed, by aggregating each article by their first tag, which attests to the semantic validity of the pipeline outputs.

Full Text