Abstract

Multilingual characteristics, lack of annotated data, and imbalanced sample distribution are the three main challenges for toxic comment analysis in a multilingual setting. This paper proposes a multilingual toxic text classifier which adopts a novel fusion strategy that combines different loss functions and multiple pre-training models. Specifically, the proposed learning pipeline starts with a series of pre-processing steps, including translation, word segmentation, purification, text digitization, and vectorization, to convert word tokens to a vectorized form suitable for the downstream tasks. Two models, multilingual bidirectional encoder representation from transformers (MBERT) and XLM-RoBERTa (XLM-R), are employed for pre-training through Masking Language Modeling (MLM) and Translation Language Modeling (TLM), which incorporate semantic and contextual information into the models. We train six base models and fuse them to obtain three fusion models using the F1 scores as the weights. The models are evaluated on the Jigsaw Multilingual Toxic Comment dataset. Experimental results show that the best fusion model outperforms the two state-of-the-art models, MBERT and XLM-R, in F1 score by 5.05% and 0.76%, respectively, verifying the effectiveness and robustness of the proposed fusion strategy.

Highlights

  • We find that compared with binary cross-entropy (BCE), Focal loss does help improve the accuracy for both XLM-R and multilingual bidirectional encoder representation from transformers (MBERT)

  • Models XLM-R_FOCAL and MBERT_FOCAL, which did not use loss function fusion and multi-model fusion, the accuracy improved by 0.19% and 0.49%, respectively, and the average macro F1 value improved by 0.76% and 5.05% respectively

  • We propose a multilingual toxic text detection method based on pretraining model fusion under imbalanced sample distribution

Read more

Summary

Introduction

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. A learning algorithm is adopted to fit the data in a training set to minimize the prediction error in an iterative fashion until convergence These feature-based learning models have demonstrated satisfying performance in various text classification tasks. On the other hand, can address this challenge by capturing the text semantic information from raw text data, without manual feature engineering and boost the detection performance [21]. To this end, deep learning algorithms have recently appeared in numerous studies on text classification. We propose a learning pipeline based on model fusion for multilingual toxic text detection. Information 2021, 12, 205 of our model in Section 3; we give the experimental data and evaluation indexes, and compare and analyze the results of different detection models on the same dataset in Section 4; Section 5 summarizes the work and proposes a future direction

Monolingual Toxic Text Detection
Multilingual Toxic Text Detection
Conventional Learning Models
Deep Learning Models
Transfer Learning via Masked Language Models
Model Fusion
Multilingual Toxic Text Detection Model Based on Multi-Model Fusion
Text Pre-Processing
Translation
Word Segmentation
Text Purification
Sample Equilibrium
Lexicon Solidification
Word Embedding
Position Embedding
Pre-Training and Fine-Tuning Multilingual Models
The BERT Language Model
Pre-Training with Masking-Based Language Modeling
Pre-Training with Translation-Based Language Modeling
Fine-Tuning
A Fusion of Loss Functions
A Fusion of Multilingual Models
Dataset
Evaluation Metrics
Models
Benchmarks
Experimental Environment and Parameter Settings
Experimental Results and Analysis
Summary and Prospect
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call