Abstract

This paper explores a phrase-based statistical machine translation (PBSMT) pipeline for English-Arabic (En-Ar)language pair. The work surveys the most recent experiments conducted to enhance Arabic machine translation in the En-Ar direction. It also focuses on free datasets and linguistically motivated ideas that enhance phrase-based En-Ar statistical machine translation (SMT) as it is as aims to use those only in order to build a large scale En-Ar SMT system. In addition, the paper highlights Arabic linguistic challenges in Machine Translation (MT) in general. This paper can be considered a guide for building an En-Ar PBSMT system. Furthermore, the presented pipeline can be generalized to any language pairs.

Highlights

  • Developing an automatic Machine Translation (MT) system over the history poses many challenges to researchers

  • statistical machine translation (SMT) is an MT paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text named as parallel corpora

  • Datasetavailability was crucial for us to be able to carry out practical experiments. While many such dataset are avialable for purchase, lack of funding has forced us to search for freely available to use datasets for En-Ar SMT bearing in mind that an SMT system needs both a monolingual corpus and a parallel corpus to train both the language model and the translation model

Read more

Summary

INTRODUCTION

Developing an automatic Machine Translation (MT) system over the history poses many challenges to researchers. 2) Dictionary look up: find the translation of a single source word in a target language dictionary. 2) Transfer: Transfer the source text parse tree into a new parse tree for the target language. Interlingua approach: It is trying to find a universal language that any language can be translated into This universal language aims to be independent and an intermediate between source and target texts. This approach' idea is to represent the semantic analysis of the source text in an abstract logical form. SMT is an MT paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text named as parallel corpora. Superficial fluency can be deceiving. o The benefits are overemphasized for European languages

STATISTICAL MACHINE TRANSLATION SYSTEM PIPELINE
ARABIC CHALLENGES IN MT
Orthographic Challenges
AVAILABLE CORPORA
LANGUAGE INDEPENDENT TOOLS
TOOLS FOR LINGUISTICS ENHANCEMENTS
Orthographically Processing Techniques
Morphologically Processing Techniques
Syntactically Processing Techniques
Findings
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call