A Compression-Based Multiple Subword Segmentation for Neural Machine Translation

Keita Nonaka,Tsuyoshi Okita,Hiroshi Sakamoto,Kazutaka Shimada,Kazutaka Yamanouchi,Tomohiro I

doi:10.3390/electronics11071014

Keita Nonaka, Tsuyoshi Okita + Show 4 more

Open Access

https://doi.org/10.3390/electronics11071014

Copy DOI

Journal: Electronics	Publication Date: Mar 24, 2022
Citations: 8	License type: CC BY 4.0

Affiliation: Kyushu Institute of Technology

Abstract

In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in neural machine translation. Among them, BPE/BPE-dropout is one of the fastest and most effective methods compared to conventional approaches; however, compression-based approaches have a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a stochastic string algorithm, called locally consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the stochastic parsing mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and we show that it outperforms various baselines in learning from especially small training data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Compression-Based Multiple Subword Segmentation for Neural Machine Translation

Abstract

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Similar Papers

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Taku Kudo
-
Taku KudoTaku Kudo
01 Jan 2018
01 Jan 2018

Finding Better Subwords for Tibetan Neural Machine Translation
Yachao Li ... Jia Yangji
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 20
Yachao Li, et. al.Yachao Li ... Jia Yangji
15 Mar 2021
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 20

SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation
Haiyue Song ... Eiichiro Sumita
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 22
Haiyue Song, et. al.Haiyue Song ... Eiichiro Sumita
24 Aug 2023
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 22

Neural Machine Translation for Low-resource English-Bangla
Mohammad Abdullah Al Mumin ... Muhammed Zafar Iqbal
Journal of Computer Science | VOL. 15
Mohammad Abdullah Al Mumin, et. al.Mohammad Abdullah Al Mumin ... Muhammed Zafar Iqbal
01 Nov 2019
Journal of Computer Science | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Compression-Based Multiple Subword Segmentation for Neural Machine Translation

Abstract

Talk to us

Similar Papers

More From: Electronics