Abstract

Parallel corpora are vital components in several applications of Natural Language Processing (NLP), particularly in machine translation. In this paper, we present a novel method for automatically creating parallel sentences from comparable corpora. The method requires a bilingual dictionary as well as an adequate word-vectorisation method. We use Arabic and English Wikipedia as a comparable corpus to apply our proposed method and construct a parallel corpus between Arabic and English. The created Arabic-English corpus consists of 105,010 parallel sentences with a total number of 4.6M words. During our study, we compared two methods of word vectorisation, word embedding and term frequency-inverse document frequency, in terms of their usefulness in computing similarities between well-formed and syntactically ill-formed sentences. We also quantitatively and qualitatively examined the parallel corpus produced by our proposed method and compared it with other available Arabic-English parallel corpora counterparts: GlobalVoices, TED, and Wiki-OPUS. We explored the main advantages and shortcomings of these corpora when used for NLP applications, such as word semantic similarity identification and Neural Machine Translation (NMT). The word semantic similarity models trained on our parallel corpus outperformed models trained on other corpora in the task of English non-similar word identification. Our parallel corpus also proved competitive when building Arabic-English NMT systems, yielding results comparable to those of the automatically created Wiki-OPUS corpus and of the manually created TED corpus, while achieving results superior to the smaller GlobalVoices corpus.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call