Similarity-based model for transliteration

Mohamed Abdel Fattah,Fuji Ren

doi:10.1007/978-0-387-76483-2_17

Abstract

AbstractA significant proportion of out of vocabulary (OOV) words are named entities and technical terms. Typical analyses find around 50% of OOV words to be named entities. Yet these can be the most important words in the queries. For example, in the list of queries for TREC 2001 cross-language track, all 25 queries contained proper names. Cross-language retrieval performance (average precision) reduced more than 50% when named entities in the queries were not translated. One way to deal with OOV words when the two languages have different alphabets is to transliterate the unknown words, that is, to render them in the orthography of the second language. Transliteration is the process of formulating a representation of words in one language using the alphabet of another language. In the present study, we present different approaches for transliteration of proper noun pair’s extraction from parallel corpora based on different similarity measures between the English and the romanized Arabic proper nouns under consideration. The strength of our new system is that it works well for low-frequency proper noun pairs. We evaluate the presented new approaches using two different English–Arabic parallel corpora. Most of our results outperform previously published results in terms of precision, recall, and F-Measure.KeywordsProper NounSentence PairParallel CorpusShort VowelArabic WordThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Full Text