Stemming is an important pre-processing step in the text analysis domains such as text mining, text summarization and information retrieval (IR). In this study, we build a Sanskrit text collection and explore different indexing, stemming and searching strategies in Sanskrit. We also propose two stemmers: a ‘light’ and an ‘aggressive’ and evaluate their effectiveness in the text analysis task. The performance of the stemmers is evaluated in two ways: a direct and an indirect IR-based evaluation. In direct evaluation, we found that the stemmers are effective. In indirect evaluation, we apply different retrieval models such as BM25, TF–IDF, Divergence from Randomness (DFR) based and language models. The proposed stemmers are compared with GRAS stemmer, language-independent indexing (trunc-n) and no stemming approach. Among different stemming methods, aggressive stemmers provide the best performance. Hiemstra language model outperforms other retrieval models we experimented with. In statistical analysis, we found that the proposed stemming approaches produce significantly better results than the no-stemming approach.
Read full abstract