Abstract
Morphologically rich and complex languages such as Arabic, pose a major challenge to neural machine translation (NMT) due to the large number of rare words and the inability of NMT to translate them. Unknown word (UNK) symbols are used to represent out-of-vocabulary words because NMT typically operates with a fixed vocabulary size. These rare words can be effectively encoded as sequences of subword units by using algorithms, such as byte pair encoding (BPE), to tackle the UNK problem. However, for languages with highly inflected and morphological variations, such as Arabic, the aforementioned method has its own limitations that make it not effective enough for translation quality. To alleviate the UNK problem and address the inconvenient behavior of BPE when translating the Arabic language, we propose to utilize a romanization system that converts Arabic scripts to subword units. We investigate the effect of our approach on NMT performance under various segmentation scenarios and compare the results with systems trained on original Arabic form. In addition, we integrate Romanized Arabic as an input factor for Arabic-sourced NMT compared with well-known factors, namely, lemma, part-of-speech tags, and morph features. Extensive experiments on Arabic-Chinese translation demonstrate that the proposed approaches can effectively tackle the UNK problem and significantly improve the translation quality for Arabic-sourced translation. Additional experiments in this study focus on developing the NMT system on Chinese-Arabic translation. Before implementing our experiments, we first propose standard criteria for the data filtering of a parallel corpus, which helps in filtering out its noise.
Highlights
Neural machine translation (NMT) has obtained impressive results in previous years [1] by outperforming traditional phrased-based statistical machine translation (PBSMT) approaches on various language pairs [2]
The corpus exhibits corrupted parts that negatively affect the quality of the systems and models that learn from the corpus; data selection and filtering on this corpus improve MT performance in terms of training time and translation quality [34]
We propose a new approach as a subword transformation solution for Arabic-sourced NMT, that is, we use morphological segmentation schemes to segment Arabic words employ a romanization system to convert the output into subword units
Summary
Neural machine translation (NMT) has obtained impressive results in previous years [1] by outperforming traditional phrased-based statistical machine translation (PBSMT) approaches on various language pairs [2]. Segmenters split words into morphemes to reduce data sparseness, enhance word alignment, and improve translation quality Even with this technique,rare and unknown words still occur in NMT. BPE may split a rare or an unknown word into either not meaningful subword units or semantically different known units, which can output semantically incorrect translations [8] These cases appear when translating the Arabic language because it has a rich and complex inflectional and cliticization morphology system. To exploit the power of our approach, we utilize Romanized Arabic as an input feature of Arabic–Chinese factored NMT system to provide additional information on Arabic words In this manner, we can further improve the translation performance.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.