SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Haiyue Song,Sadao Kurohashi,Chenhui Chu,Raj Dabre,Eiichiro Sumita

doi:10.1145/3610611

Abstract

Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient, as they require parallel corpora, days to train, and hours to decode. This article introduces SelfSeg, a self-supervised neural sub-word segmentation method that is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora. SelfSeg takes as input a word in the form of a partially masked character sequence, optimizes the word generation probability, and generates the segmentation with the maximum posterior probability, which is calculated using a dynamic programming algorithm. The training time of SelfSeg depends on word frequencies, and we explore several word frequency normalization strategies to accelerate the training phase. Additionally, we propose a regularization mechanism that allows the segmenter to generate various segmentations for one word. To show the effectiveness of our approach, we conduct MT experiments in low-, middle-, and high-resource scenarios, where we compare the performance of using different segmentation methods. The experimental results demonstrate that, on the low-resource ALT dataset, our method achieves more than 1.2 BLEU score improvement compared with BPE and SentencePiece, and a 1.1 score improvement over Dynamic Programming Encoding (DPE) and Vocabulary Learning via Optimal Transport (VOLT), on average. The regularization method achieves approximately a 4.3 BLEU score improvement over BPE and a 1.2 BLEU score improvement over BPE-dropout, the regularized version of BPE. We also observed significant improvements on IWSLT15 Vi→En, WMT16 Ro→En, and WMT15 Fi→En datasets and competitive results on the WMT14 De→En and WMT14 Fr→En datasets. Furthermore, our method is 17.8× faster during training and up to 36.8× faster during decoding in a high-resource scenario compared to DPE. We provide extensive analysis, including why monolingual word-level data is enough to train SelfSeg.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing	Publication Date: Aug 24, 2023
Citations: 2

Similar Papers

BPE-Dropout: Simple and Effective Subword Regularization
Ivan Provilkov ... Elena Voita
-
Ivan Provilkov, et. al.Ivan Provilkov ... Elena Voita
01 Jan 2020
01 Jan 2020

Controlling byte pair encoding for neural machine translation
Alfred John Tacorda ... Marvin John Ignacio
-
Alfred John Tacorda, et. al.Alfred John Tacorda ... Marvin John Ignacio
01 Dec 2017
01 Dec 2017

Transliteration and Byte Pair Encoding to Improve Tamil to Sinhala Neural Machine Translation
Pasindu Tennage ... Prabath Sandaruwan
-
Pasindu Tennage, et. al.Pasindu Tennage ... Prabath Sandaruwan
01 May 2018
01 May 2018

Bidirectional LSTMs with Byte Pair Encoding in NMT for CLIR using English and Telugu Parallel Corpus
Et Al B N V Narasimha Raju
International Journal on Recent and Innovation Trends in Computing and Communication | VOL. 11
Et Al B N V Narasimha RajuEt Al B N V Narasimha Raju
30 Oct 2023
International Journal on Recent and Innovation Trends in Computing and Communication | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing