A Study of Multilingual Toxic Text Detection Approaches under Imbalanced Sample Distribution

Guizhe Song,Zhifeng Xiao,Degen Huang

doi:10.3390/info12050205

Guizhe Song, Zhifeng Xiao + Show 1 more

Open Access

https://doi.org/10.3390/info12050205

Copy DOI

Abstract

Multilingual characteristics, lack of annotated data, and imbalanced sample distribution are the three main challenges for toxic comment analysis in a multilingual setting. This paper proposes a multilingual toxic text classifier which adopts a novel fusion strategy that combines different loss functions and multiple pre-training models. Specifically, the proposed learning pipeline starts with a series of pre-processing steps, including translation, word segmentation, purification, text digitization, and vectorization, to convert word tokens to a vectorized form suitable for the downstream tasks. Two models, multilingual bidirectional encoder representation from transformers (MBERT) and XLM-RoBERTa (XLM-R), are employed for pre-training through Masking Language Modeling (MLM) and Translation Language Modeling (TLM), which incorporate semantic and contextual information into the models. We train six base models and fuse them to obtain three fusion models using the F1 scores as the weights. The models are evaluated on the Jigsaw Multilingual Toxic Comment dataset. Experimental results show that the best fusion model outperforms the two state-of-the-art models, MBERT and XLM-R, in F1 score by 5.05% and 0.76%, respectively, verifying the effectiveness and robustness of the proposed fusion strategy.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Information	Publication Date: May 12, 2021
Citations: 10	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Study of Multilingual Toxic Text Detection Approaches under Imbalanced Sample Distribution

Abstract

Talk to us

Similar Papers

More From: Information

Lead the way for us

Similar Papers

Unsupervised segmentation of words into morphemes - morpho challenge 2005 application to automatic speech recognition
Mikko Kurimo ... Ebru Arsoy
-
Mikko Kurimo, et. al.Mikko Kurimo ... Ebru Arsoy
17 Sep 2006
Unsupervised segmentation of words into morphemes - morpho challenge 2005 application to automatic speech recognition
Mikko Kurimo ... Ebru Arsoy

Malayalam to English Translation: A Statistical Approach
Blessy B John ... L Sobha
-
Blessy B John, et. al.Blessy B John ... L Sobha
01 Jan 2019
01 Jan 2019

Extremely Low-Resource Text Simplification with Pre-trained Transformer Language Model
Takumi Maruyama ... Kazuhide Yamamoto
International Journal of Asian Language Processing | VOL. 30
Takumi Maruyama, et. al.Takumi Maruyama ... Kazuhide Yamamoto
01 Mar 2020
International Journal of Asian Language Processing | VOL. 30

Log-Bilinear Document Language Model for Ad-hoc Information Retrieval
Xinhui Tu ... Jing Luo
-
Xinhui Tu, et. al.Xinhui Tu ... Jing Luo
03 Nov 2014
03 Nov 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Study of Multilingual Toxic Text Detection Approaches under Imbalanced Sample Distribution

Abstract

Talk to us

Similar Papers

More From: Information