Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding

Mieradilijiang Maimaiti,Maosong Sun,Huanbo Luan,Zegao Pan,Yang Liu

doi:10.1145/3464427

Abstract

Data augmentation is an approach for several text generation tasks. Generally, in the machine translation paradigm, mainly in low-resource language scenarios, many data augmentation methods have been proposed. The most used approaches for generating pseudo data mainly lay in word omission, random sampling, or replacing some words in the text. However, previous methods barely guarantee the quality of augmented data. In this work, we try to build the data by using paraphrase embedding and POS-Tagging. Namely, we generate the fake monolingual corpus by replacing the main four POS-Tagging labels, such as noun, adjective, adverb, and verb, based on both the paraphrase table and their similarity. We select the bigger corpus size of the paraphrase table with word level and obtain the word embedding of each word in the table, then calculate the cosine similarity between these words and tagged words in the original sequence. In addition, we exploit the ranking algorithm to choose highly similar words to reduce semantic errors and leverage the POS-Tagging replacement to mitigate syntactic error to some extent. Experimental results show that our augmentation method consistently outperforms all previous SOTA methods on the low-resource language pairs in seven language pairs from four corpora by 1.16 to 2.39 BLEU points.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing	Publication Date: Aug 12, 2021
Citations: 4

Similar Papers

Analytical Review of Methods for Solving Data Scarcity Issues Regarding Elaboration of Automatic Speech Recognition Systems for Low-Resource Languages
Ildar Kagirov ... Irina Kipyatkova
SPIIRAS Proceedings | VOL. 21
Ildar Kagirov, et. al.Ildar Kagirov ... Irina Kipyatkova
08 Jul 2022
SPIIRAS Proceedings | VOL. 21

Optimizing Data Augmentation for Semantic Segmentation on Small-Scale Dataset
Rui Ma ... Pin Tao
-
Rui Ma, et. al.Rui Ma ... Pin Tao
15 Jun 2019
15 Jun 2019

Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages
Hai-Long Trieu ... Duc-Vu Tran
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 18
Hai-Long Trieu, et. al.Hai-Long Trieu ... Duc-Vu Tran
17 Jun 2019
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 18

Data Augmentation for Building Footprint Segmentation in SAR Images: An Empirical Study
Sandhi Wangiyana ... Piotr Samczyński
Remote sensing | VOL. 14
Sandhi Wangiyana, et. al.Sandhi Wangiyana ... Piotr Samczyński
22 Apr 2022
Remote sensing | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing