Abstract
This study aims to address challenges in media monitoring by enhancing closed-set topic classification in multilingual contexts (where both training and testing occur in several languages) and crosslingual contexts (where training is in English and testing spans all languages). To achieve this goal, we utilized a dataset from the European Media Monitoring webpage, which includes approximately 15,000 article titles across 18 topics in 58 different languages spanning a period of nine months from May 2022 to March 2023. Our research conducted comprehensive comparative analyses of nine approaches, encompassing a spectrum of embedding techniques (word, sentence, and contextual representations) and classifiers (trainable/fine-tunable, memory-based, and generative). Our findings reveal that the LaBSE+FFNN approach achieved the best performance, reaching macro-averaged F1-scores of 0.944 ± 0.015 and 0.946 ± 0.019 in both multilingual and crosslingual scenarios. LaBSE+FFNN’s similar performance in multilingual and crosslingual scenarios eliminates the need for machine translation into English. We also tackled the open-set topic classification problem by training a binary classifier capable of distinguishing between known and new topics with the average loss of ∼0.0017 ± 0.0002. Various feature types were investigated, reaffirming the robustness of LaBSE vectorization. The experiments demonstrate that, depending on the topic, new topics can be identified with accuracies above ∼0.796 and of ∼0.9 on average. Both closed-set and open-set topic classification modules, along with additional mechanisms for clustering new topics to organize and label them, are integrated into our media monitoring system, which is now used by our real client.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.