Abstract
In this paper we propose a hybrid approach to align single words, compound words and idiomatic expressions from English-Arabic parallel corpora. The objective is to develop, improve and maintain automatically translation lexicons. This approach combines linguistic and statistical information in order to improve word alignment results. The linguistic improvements taken into account refer to the use of an existing bilingual lexicon, named entity recognition, grammatical tag matching and detection of syntactic dependency relation between words. Statistical information refers to the number of occurrences of repeated words, their positions in the parallel corpus and their lengths in terms of number of characters. Single-word alignment uses an existing bilingual lexicon, named entities and cognate detection and grammatical tag matching. Compound word alignment consists of establishing correspondences between the compound words of the source sentence and the compound words of the target sentences. A syntactic analysis is applied to the source and target sentences in order to extract dependency relations between words and to recognize compound words. Idiomatic expression alignment starts with a monolingual term extraction for each of the source and target languages, which provides a list of sequences of repeated words and a list of potential translations. These sequences are represented with vectors which indicate their number of occurrences and the number of segments in which they appear. Then, translation relations between the source and target expressions are evaluated with a distance metric. We have evaluated the single and multiword expression aligners using two methods: A manual evaluation of the alignment quality on 1000 pairs of English-Arabic sentences and an evaluation of the impact of this alignment on the translation quality of a machine translation system. The obtained results showed that these aligners, on the one hand, generate a translation lexicon with around 85% precision, and on the other hand, report a gain in BLEU score of 0.20 for the translation quality.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have