OOV words in an English-Arabic CLIR system

Abdelghani Bellaachia,Ghita Amortijani

doi:10.1109/iscc.2008.4625724

Abstract

Proper nouns are usually primary keys in a query. Their correct translation might be necessary to maintain a good retrieval performance in a cross language information retrieval (CLIR) system. However, dictionaries only include the most commonly used proper nouns, like major countries and capitals. As they are spelling variants of each other in most languages, using an approximate string matching technique against the target database index is the common approach taken to find the target language correspondents of the original query key. N-gram technique proved to be the most effective among other approximate string matching techniques. As we are dealing with an English-Arabic CLIR system which involves two languages of different alphabets, we decided to combine transliteration with the n-gram technique to generate the different spelling variants of out of vocabulary (OOV) words. We call this technique: Transliteration Ngram (TNG). One issue that arises with the Arabic language is that words that are spelled similarly can have different meanings depending on the context of the sentence. This is particularly true for proper names, which usually have a meaning if used as a verb or adjective. To further enhance our transliteration approach, we chose to use part of speech (POS) disambiguation to reduce the number of unrelated words from the set transliterations obtained using TNG.

Full Text