Index Term Selection Heuristics for Arabic Text Retrieval

Yaser A Al-Lahham

doi:10.1007/s13369-020-05022-3

Abstract

The Arabic index term selection is a challenging process due to the complex morphological nature of the Arabic language. Index term selection is a significant factor that affects the efficiency of any information retrieval system. Many methods of index term selection were proposed in the literature. The majority of them were based on root extraction and stemming. Other proposals apply complex linguistic rules and machine learning tools. This paper proposes a simple index term selection method using some heuristics such that a representative subset of terms is selected to form the index. The proposed heuristics essentially select index terms from Arabic words having the prefix ‘AL’ (definite words) as a basis. Besides, the proposed method selects new words according to any of the following heuristics: the words preceding or words succeeding definite terms, choosing words that follow some linking words and words following propositions in semi-sentences, and selecting words that represent named entities. The proposed heuristics were tested using the TREC-2001/2002 Arabic test collection. The results show the effectiveness of the proposed method since it outperforms selecting all terms stemmed by two well-known stemmers. For example, choosing definite words and words that represent named entities outperforms selecting all terms stemmed by the LIGHT10 stemmer according to the mean average precision by 8.4% and at the same time decreases the index size by 27.8%.

Full Text