Towards Media Monitoring: Detecting Known and Emerging Topics through Multilingual and Crosslingual Text Classification

Jurgita Kapočiūtė-Dzikienė,Jurgita Kapočiūtė-Dzikienė,Arūnas Ungulaitis

doi:10.3390/app14104320

Jurgita Kapočiūtė-Dzikienė, Jurgita Kapočiūtė-Dzikienė + Show 1 more

Open Access

PDF Available

https://doi.org/10.3390/app14104320

Copy DOI

Export

Save

Cite

Journal: Applied Sciences	Publication Date: May 20, 2024
License type: CC BY 4.0

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

This study aims to address challenges in media monitoring by enhancing closed-set topic classification in multilingual contexts (where both training and testing occur in several languages) and crosslingual contexts (where training is in English and testing spans all languages). To achieve this goal, we utilized a dataset from the European Media Monitoring webpage, which includes approximately 15,000 article titles across 18 topics in 58 different languages spanning a period of nine months from May 2022 to March 2023. Our research conducted comprehensive comparative analyses of nine approaches, encompassing a spectrum of embedding techniques (word, sentence, and contextual representations) and classifiers (trainable/fine-tunable, memory-based, and generative). Our findings reveal that the LaBSE+FFNN approach achieved the best performance, reaching macro-averaged F1-scores of 0.944 ± 0.015 and 0.946 ± 0.019 in both multilingual and crosslingual scenarios. LaBSE+FFNN’s similar performance in multilingual and crosslingual scenarios eliminates the need for machine translation into English. We also tackled the open-set topic classification problem by training a binary classifier capable of distinguishing between known and new topics with the average loss of ∼0.0017 ± 0.0002. Various feature types were investigated, reaffirming the robustness of LaBSE vectorization. The experiments demonstrate that, depending on the topic, new topics can be identified with accuracies above ∼0.796 and of ∼0.9 on average. Both closed-set and open-set topic classification modules, along with additional mechanisms for clustering new topics to organize and label them, are integrated into our media monitoring system, which is now used by our real client.

Full Text