Arabic–Chinese Neural Machine Translation: Romanized Arabic as Subword Unit for Arabic-sourced Translation

Fares Aqlan,Akram Al-Mansoub,Xiaoping Fan,Abdullah Alqwbani

doi:10.1109/access.2019.2941161

Fares Aqlan, Akram Al-Mansoub + Show 2 more

Open Access

https://doi.org/10.1109/access.2019.2941161

Copy DOI

Abstract

Morphologically rich and complex languages such as Arabic, pose a major challenge to neural machine translation (NMT) due to the large number of rare words and the inability of NMT to translate them. Unknown word (UNK) symbols are used to represent out-of-vocabulary words because NMT typically operates with a fixed vocabulary size. These rare words can be effectively encoded as sequences of subword units by using algorithms, such as byte pair encoding (BPE), to tackle the UNK problem. However, for languages with highly inflected and morphological variations, such as Arabic, the aforementioned method has its own limitations that make it not effective enough for translation quality. To alleviate the UNK problem and address the inconvenient behavior of BPE when translating the Arabic language, we propose to utilize a romanization system that converts Arabic scripts to subword units. We investigate the effect of our approach on NMT performance under various segmentation scenarios and compare the results with systems trained on original Arabic form. In addition, we integrate Romanized Arabic as an input factor for Arabic-sourced NMT compared with well-known factors, namely, lemma, part-of-speech tags, and morph features. Extensive experiments on Arabic-Chinese translation demonstrate that the proposed approaches can effectively tackle the UNK problem and significantly improve the translation quality for Arabic-sourced translation. Additional experiments in this study focus on developing the NMT system on Chinese-Arabic translation. Before implementing our experiments, we first propose standard criteria for the data filtering of a parallel corpus, which helps in filtering out its noise.

Highlights

Neural machine translation (NMT) has obtained impressive results in previous years [1] by outperforming traditional phrased-based statistical machine translation (PBSMT) approaches on various language pairs [2]
The corpus exhibits corrupted parts that negatively affect the quality of the systems and models that learn from the corpus; data selection and filtering on this corpus improve MT performance in terms of training time and translation quality [34]
We propose a new approach as a subword transformation solution for Arabic-sourced NMT, that is, we use morphological segmentation schemes to segment Arabic words employ a romanization system to convert the output into subword units

Summary

INTRODUCTION

Neural machine translation (NMT) has obtained impressive results in previous years [1] by outperforming traditional phrased-based statistical machine translation (PBSMT) approaches on various language pairs [2]. Segmenters split words into morphemes to reduce data sparseness, enhance word alignment, and improve translation quality Even with this technique,rare and unknown words still occur in NMT. BPE may split a rare or an unknown word into either not meaningful subword units or semantically different known units, which can output semantically incorrect translations [8] These cases appear when translating the Arabic language because it has a rich and complex inflectional and cliticization morphology system. To exploit the power of our approach, we utilize Romanized Arabic as an input feature of Arabic–Chinese factored NMT system to provide additional information on Arabic words In this manner, we can further improve the translation performance.

RELATED WORK

NEURAL MACHINE TRANSLATION

SEGMENTATIONS AND SUBWORD APPROACHES

PROPOSED APPROACH

LINGUISTIC INPUT FEATURES

PROPOSED FEATURE

EXPERIMENTS AND RESULTS

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2019
Citations: 13	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Arabic–Chinese Neural Machine Translation: Romanized Arabic as Subword Unit for Arabic-sourced Translation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Effect of linguistic information in neural machine translation
Naomichi Nakamura ... Hitoshi Isahara
-
Naomichi Nakamura, et. al.Naomichi Nakamura ... Hitoshi Isahara
01 Aug 2017
01 Aug 2017

Bidirectional LSTMs with Byte Pair Encoding in NMT for CLIR using English and Telugu Parallel Corpus
Et Al B N V Narasimha Raju
International Journal on Recent and Innovation Trends in Computing and Communication | VOL. 11
Et Al B N V Narasimha RajuEt Al B N V Narasimha Raju
30 Oct 2023
International Journal on Recent and Innovation Trends in Computing and Communication | VOL. 11

Transliteration and Byte Pair Encoding to Improve Tamil to Sinhala Neural Machine Translation
Pasindu Tennage ... Malith Thilakarathne
-
Pasindu Tennage, et. al.Pasindu Tennage ... Malith Thilakarathne
01 May 2018
01 May 2018

Experience of neural machine translation between Indian languages
Shubham Dewangan ... Pushpak Bhattacharyya
Machine Translation | VOL. 35
Shubham Dewangan, et. al.Shubham Dewangan ... Pushpak Bhattacharyya
01 Apr 2021
Machine Translation | VOL. 35

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Arabic–Chinese Neural Machine Translation: Romanized Arabic as Subword Unit for Arabic-sourced Translation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access