The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation

Hassan Al-Haj,Alon Lavie

doi:10.1007/s10590-011-9101-1

Abstract

Morphologically rich languages pose a challenge for statistical machine translation (SMT). This challenge is magnified when translating into a morphologically rich language. In this work we address this challenge in the framework of a broad-coverage English-to-Arabic phrase based statistical machine translation (PBSMT). We explore the largest-to-date set of Arabic segmentation schemes ranging from full word form to fully segmented forms and examine the effects on system performance. Our results show a difference of 2.31 BLEU points averaged over all test sets between the best and worst segmentation schemes indicating that the choice of the segmentation scheme has a significant effect on the performance of an English-to-Arabic PBSMT system in a large data scenario. We show that a simple segmentation scheme can perform as well as the best and more complicated segmentation scheme. An in-depth analysis on the effect of segmentation choices on the components of a PBSMT system reveals that text fragmentation has a negative effect on the perplexity of the language models and that aggressive segmentation can significantly increase the size of the phrase table and the uncertainty in choosing the candidate translation phrases during decoding. An investigation conducted on the output of the different systems, reveals the complementary nature of the output and the great potential in combining them.

Highlights

Rich languages pose a challenge for statistical machine translation (SMT), as these languages possess a large set of morphological features producing a large number of rich surface forms
In this work we investigated the impact of Arabic morphological segmentation on the performance of a broad-coverage English-to-Arabic SMT system
We explored the full spectrum of Arabic segmentation schemes ranging from full word form to fully segmented forms and examined the effects on system performance

Summary

Introduction

Rich languages pose a challenge for statistical machine translation (SMT), as these languages possess a large set of morphological features producing a large number of rich surface forms This increase in surface forms leads to larger vocabularies and higher sparsity adversely affecting the performance of SMT systems. The built PBSMT systems are evaluated and compared on English-to-Arabic test sets that we construct from existing NIST09 Arabic-toEnglish test sets Based on this comparison we identify the best and the worst segmentation schemes and lay out a set of general observations on the effect of splitting off different sets of clitics (affixes) on the performance of a broad coverage PBSMT system. In their work they compare a segmented Englsih-to-Arabic system with an unsegmented system They experiment with a number of detokenization techniques.

Arabic Morphology and Orthography

Arabic words have fourteen morphological features

Arabic Preprocessing Schemes

Challenges of Arabic Detokenization

Detokenization techniques

Training and Testing Data

Training Data

Tuning and testing sets

Results

Conclusions and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Machine Translation	Publication Date: Sep 22, 2011
Citations: 26	License type: cc-by

R Discovery Prime

R Discovery Prime

The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Machine Translation

Lead the way for us

Similar Papers

Toward Building a Comprehensive Phrase-based English-Arabic Statistical Machine Translation System
Sara Ebrahim ... Mostafa Mostafa
The Egyptian Journal of Language Engineering | VOL. 4
Sara Ebrahim, et. al.Sara Ebrahim ... Mostafa Mostafa
15 Sep 2017
The Egyptian Journal of Language Engineering | VOL. 4

Hybrid data-driven models of machine translation
Declan Groves ... Andy Way
Machine Translation | VOL. 19
Declan Groves, et. al.Declan Groves ... Andy Way
02 Nov 2006
Machine Translation | VOL. 19

Deep learning-based techniques to enhance the precision of phrase-based statistical machine translation system for Indian languages
K.P Soman ... J.P Sanjanasri
International Journal of Computer Aided Engineering and Technology | VOL. 13
K.P Soman, et. al.K.P Soman ... J.P Sanjanasri
01 Jan 2020
International Journal of Computer Aided Engineering and Technology | VOL. 13

Statistical machine translation of Indian languages: a survey
Nadeem Khan Jadoon ... Waqas Anwar
Neural Computing and Applications | VOL. 31
Nadeem Khan Jadoon, et. al.Nadeem Khan Jadoon ... Waqas Anwar
17 Nov 2017
Neural Computing and Applications | VOL. 31

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Machine Translation