Abstract
We explore and evaluate the effect of different stopword lists (non-corpus-based and corpus-based) in the information retrieval (IR) tasks with different Indian languages such as Bengali, Marathi, Gujarati, Hindi and English. The issue was investigated from three viewpoints. Is there any performance difference between non-corpus-based and corpus-based stopword removal in chosen Indian languages? Can corpus-based stopword lists improve performance in Indian languages IR? If yes, to what extent? Among the different corpus-based stopword lists, which stopword list provides the best IR performance? Does the length of a corpus-based stopword list affect the retrieval performance in Indian languages? If yes, to what extent? It was observed that a corpus-based stopword list provides better retrieval performance than a non-corpus-based stopword list in different Indian languages. Among the different corpus-based stopword lists generated and experimented with, Zipf’s law-based stopword list (idf-based one) provides the best retrieval performance in various Indian languages. The aggregation1-based stopword list provides better retrieval than the aggregation2-based list in Indian languages, but in English, the aggregation2-based stopword list performs better than the aggregation1-based list. The best performing idf-based stopword list improves MAP score by 5.43% in Bengali, 1.91% in Marathi, 5.4% in Gujarati, 1.5% in Hindi and 2.12% in English, respectively, over their baseline counterparts. The probabilistic retrieval models (BM25 and TF-IDF) perform best in different Indian languages. A smaller length of corpus-based stopword lists performs better than a larger length of non-corpus-based stopword lists for all the Indian languages considered. The proposed schemes demonstrate that a stopword list can be heuristically generated in a language-independent statistical method and effectively used for IR tasks with performance comparable, to or even better than non-corpus-based stopword lists.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: ACM Transactions on Asian and Low-Resource Language Information Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.