Stemming plays a crucial role in natural language processing and information retrieval. It is challenging for the Gujarati language due to the complex morphology of several stemming algorithms for the Gujarati language that have been developed using rule-based, dictionary-based, or hybrid approaches. However, they are computationally expensive, produce more over-stemming errors and have limited accuracy. This paper introduces three novel optimized Gujarati stemmers using a trie data structure to overcome the above-mentioned limitations. The significant contributions to this paper are as follows. First, three optimized Gujarati stemmers, namely Optimized Gujarati Stemmer using Suffix Stripping Approach (OGS_SSA), Optimized Gujarati Stemmer using Rule-Based Approach (OGS_RBA), and Optimized Gujarati Stemmer using Re-parsing Based Approach (OGS_RPA), are proposed. Second, a novel algorithm to create a Gujarati dictionary using the trie data structure is proposed. Third, the proposed stemmers are rigorously assessed using three standard datasets, namely entertainment, health, and agriculture. The performance of the proposed stemmers is measured using evaluation parameters such as precision, recall, F1 score, accuracy, number of stemming errors and processing time. The results show that OGS_RPA consistently exceeds the OGS_SSA and OGS_RBA for precision, recall, F1 score, and accuracy. In addition, it exhibits a lower number of stemming errors. Moreover, the performance of the proposed stemmer is compared with the existing Gujarati hybrid stemmer. The results show a 14–16% improvement in accuracy and less processing time compared to the Gujarati hybrid stemmer. OGS_SSA demonstrated enhanced processing time, making it a feasible option for applications that prioritize prompt response time. Furthermore, it demonstrates 10–11% enhancement in accuracy and a reduction in processing time than the Gujarati hybrid stemmer. OGS_RBA exhibits moderate performance due to its rule-based methodology compared to OGS_RPA and OGS_SSA. However, it shows 10–13% improvement in accuracy than the Gujarati hybrid stemmer.
Read full abstract