Abstract

Morphologically rich and complex languages such as Arabic, pose a major challenge to neural machine translation (NMT) due to the large number of rare words and the inability of NMT to translate them. Unknown word (UNK) symbols are used to represent out-of-vocabulary words because NMT typically operates with a fixed vocabulary size. These rare words can be effectively encoded as sequences of subword units by using algorithms, such as byte pair encoding (BPE), to tackle the UNK problem. However, for languages with highly inflected and morphological variations, such as Arabic, the aforementioned method has its own limitations that make it not effective enough for translation quality. To alleviate the UNK problem and address the inconvenient behavior of BPE when translating the Arabic language, we propose to utilize a romanization system that converts Arabic scripts to subword units. We investigate the effect of our approach on NMT performance under various segmentation scenarios and compare the results with systems trained on original Arabic form. In addition, we integrate Romanized Arabic as an input factor for Arabic-sourced NMT compared with well-known factors, namely, lemma, part-of-speech tags, and morph features. Extensive experiments on Arabic-Chinese translation demonstrate that the proposed approaches can effectively tackle the UNK problem and significantly improve the translation quality for Arabic-sourced translation. Additional experiments in this study focus on developing the NMT system on Chinese-Arabic translation. Before implementing our experiments, we first propose standard criteria for the data filtering of a parallel corpus, which helps in filtering out its noise.

Highlights

  • Neural machine translation (NMT) has obtained impressive results in previous years [1] by outperforming traditional phrased-based statistical machine translation (PBSMT) approaches on various language pairs [2]

  • The corpus exhibits corrupted parts that negatively affect the quality of the systems and models that learn from the corpus; data selection and filtering on this corpus improve MT performance in terms of training time and translation quality [34]

  • We propose a new approach as a subword transformation solution for Arabic-sourced NMT, that is, we use morphological segmentation schemes to segment Arabic words employ a romanization system to convert the output into subword units

Read more

Summary

INTRODUCTION

Neural machine translation (NMT) has obtained impressive results in previous years [1] by outperforming traditional phrased-based statistical machine translation (PBSMT) approaches on various language pairs [2]. Segmenters split words into morphemes to reduce data sparseness, enhance word alignment, and improve translation quality Even with this technique,rare and unknown words still occur in NMT. BPE may split a rare or an unknown word into either not meaningful subword units or semantically different known units, which can output semantically incorrect translations [8] These cases appear when translating the Arabic language because it has a rich and complex inflectional and cliticization morphology system. To exploit the power of our approach, we utilize Romanized Arabic as an input feature of Arabic–Chinese factored NMT system to provide additional information on Arabic words In this manner, we can further improve the translation performance.

RELATED WORK
NEURAL MACHINE TRANSLATION
SEGMENTATIONS AND SUBWORD APPROACHES
PROPOSED APPROACH
LINGUISTIC INPUT FEATURES
PROPOSED FEATURE
EXPERIMENTS AND RESULTS
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call