Abstract

Morphologically rich languages pose a challenge for statistical machine translation (SMT). This challenge is magnified when translating into a morphologically rich language. In this work we address this challenge in the framework of a broad-coverage English-to-Arabic phrase based statistical machine translation (PBSMT). We explore the largest-to-date set of Arabic segmentation schemes ranging from full word form to fully segmented forms and examine the effects on system performance. Our results show a difference of 2.31 BLEU points averaged over all test sets between the best and worst segmentation schemes indicating that the choice of the segmentation scheme has a significant effect on the performance of an English-to-Arabic PBSMT system in a large data scenario. We show that a simple segmentation scheme can perform as well as the best and more complicated segmentation scheme. An in-depth analysis on the effect of segmentation choices on the components of a PBSMT system reveals that text fragmentation has a negative effect on the perplexity of the language models and that aggressive segmentation can significantly increase the size of the phrase table and the uncertainty in choosing the candidate translation phrases during decoding. An investigation conducted on the output of the different systems, reveals the complementary nature of the output and the great potential in combining them.

Highlights

  • Rich languages pose a challenge for statistical machine translation (SMT), as these languages possess a large set of morphological features producing a large number of rich surface forms

  • In this work we investigated the impact of Arabic morphological segmentation on the performance of a broad-coverage English-to-Arabic SMT system

  • We explored the full spectrum of Arabic segmentation schemes ranging from full word form to fully segmented forms and examined the effects on system performance

Read more

Summary

Introduction

Rich languages pose a challenge for statistical machine translation (SMT), as these languages possess a large set of morphological features producing a large number of rich surface forms This increase in surface forms leads to larger vocabularies and higher sparsity adversely affecting the performance of SMT systems. The built PBSMT systems are evaluated and compared on English-to-Arabic test sets that we construct from existing NIST09 Arabic-toEnglish test sets Based on this comparison we identify the best and the worst segmentation schemes and lay out a set of general observations on the effect of splitting off different sets of clitics (affixes) on the performance of a broad coverage PBSMT system. In their work they compare a segmented Englsih-to-Arabic system with an unsegmented system They experiment with a number of detokenization techniques.

Arabic Morphology and Orthography
Arabic words have fourteen morphological features
Arabic Preprocessing Schemes
Challenges of Arabic Detokenization
Detokenization techniques
Training and Testing Data
Training Data
Tuning and testing sets
Results
Conclusions and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.