Decoding Strategies for Improving Low-Resource Machine Translation

Chanjun Park,Yeongwook Yang,Heuiseok Lim,Kinam Park

doi:10.3390/electronics9101562

Chanjun Park, Yeongwook Yang + Show 2 more

Open Access

https://doi.org/10.3390/electronics9101562

Copy DOI

Journal: Electronics	Publication Date: Sep 24, 2020
Citations: 16	License type: CC BY 4.0

Affiliation: University of Tartu, Korea University

Abstract

Pre-processing and post-processing are significant aspects of natural language processing (NLP) application software. Pre-processing in neural machine translation (NMT) includes subword tokenization to alleviate the problem of unknown words, parallel corpus filtering that only filters data suitable for training, and data augmentation to ensure that the corpus contains sufficient content. Post-processing includes automatic post editing and the application of various strategies during decoding in the translation process. Most recent NLP researches are based on the Pretrain-Finetuning Approach (PFA). However, when small and medium-sized organizations with insufficient hardware attempt to provide NLP services, throughput and memory problems often occur. These difficulties increase when utilizing PFA to process low-resource languages, as PFA requires large amounts of data, and the data for low-resource languages are often insufficient. Utilizing the current research premise that NMT model performance can be enhanced through various pre-processing and post-processing strategies without changing the model, we applied various decoding strategies to Korean–English NMT, which relies on a low-resource language pair. Through comparative experiments, we proved that translation performance could be enhanced without changes to the model. We experimentally examined how performance changed in response to beam size changes and n-gram blocking, and whether performance was enhanced when a length penalty was applied. The results showed that various decoding strategies enhance the performance and compare well with previous Korean–English NMT approaches. Therefore, the proposed methodology can improve the performance of NMT models, without the use of PFA; this presents a new perspective for improving machine translation performance.

Highlights

Natural language processing (NLP) is a subfield of artificial intelligence in which computers analyze human languages
NLP is divided into three main categories: rules-based, statistics-based, and deep learning-based
This study presents a methodology that can improve performance without using Pretrain-Finetuning Approach (PFA), and it presents a new perspective for improving machine translation (MT) performance

Summary

Introduction

Natural language processing (NLP) is a subfield of artificial intelligence in which computers analyze human languages. NLP is divided into three main categories: rules-based, statistics-based, and deep learning-based. In rules-based and statistics-based NLP application software, system performance is dependent on the performance of various subcomponents such as the speech tagger, syntactic parser, and semantic analyzer. Deep learning-based NLP application software are operated in an end-to-end manner, and the performance of a model is independent of the subcomponents. The processes required for each step of an end-to-end process are handled simultaneously during training. Deep learning-based NLP application software have exhibited good innovative performance in various NLP fields such as machine translation, speech recognition, and text

Objectives

Results

Conclusion