Stopword List Research Articles

We explore and evaluate the effect of different stopword lists (non-corpus-based and corpus-based) in the information retrieval (IR) tasks with different Indian languages such as Bengali, Marathi, Gujarati, Hindi and English. The issue was investigated from three viewpoints. Is there any performance difference between non-corpus-based and corpus-based stopword removal in chosen Indian languages? Can corpus-based stopword lists improve performance in Indian languages IR? If yes, to what extent? Among the different corpus-based stopword lists, which stopword list provides the best IR performance? Does the length of a corpus-based stopword list affect the retrieval performance in Indian languages? If yes, to what extent? It was observed that a corpus-based stopword list provides better retrieval performance than a non-corpus-based stopword list in different Indian languages. Among the different corpus-based stopword lists generated and experimented with, Zipf’s law-based stopword list (idf-based one) provides the best retrieval performance in various Indian languages. The aggregation1-based stopword list provides better retrieval than the aggregation2-based list in Indian languages, but in English, the aggregation2-based stopword list performs better than the aggregation1-based list. The best performing idf-based stopword list improves MAP score by 5.43% in Bengali, 1.91% in Marathi, 5.4% in Gujarati, 1.5% in Hindi and 2.12% in English, respectively, over their baseline counterparts. The probabilistic retrieval models (BM25 and TF-IDF) perform best in different Indian languages. A smaller length of corpus-based stopword lists performs better than a larger length of non-corpus-based stopword lists for all the Indian languages considered. The proposed schemes demonstrate that a stopword list can be heuristically generated in a language-independent statistical method and effectively used for IR tasks with performance comparable, to or even better than non-corpus-based stopword lists.

Read full abstract

Objective. This study aimed to identify the primary research areas, countries, and organizational involvement in publications on neurological disorders through an analysis of human-assigned keywords. These results were then compared with unsupervised and machine-algorithm-based extracted terms from the title and abstract of the publications to gain knowledge about deficiencies of both techniques. This has enabled us to understand how far machine-derived terms through titles and abstracts can be a substitute for human-assigned keywords of scientific research articles. Design/Methodology/Approach. While significant research areas on neurological disorders were identified from the author-provided keywords of downloaded publications of Web of Science and PubMed, these results were compared by the terms extracted from titles and abstracts through unsupervised based models like VOSviewer and machine-algorithm-based techniques like YAKE and CounterVectorizer. Results/Discussion. We observed that the post-covid-19 era witnessed more research on various neurological disorders, but authors still chose more generic terms in the keyword list than specific ones. The unsupervised extraction tool, like VOSviewer, identified many other extraneous and insignificant terms along with significant ones. However, our self-developed machine learning algorithm using CountVectorizer and YAKE provided precise results subject to adding more stop-words in the dictionary of the stop-word list of the NLTK tool kit. Conclusion. We observed that although author provided keywords play a vital role as they are assigned in a broader sense by the author to increase readability, these concept terms lacked specificity for in-depth analysis. We suggested that the ML algorithm being more compatible with unstructured data was a valid alternative to the author-generated keywords for more accurate results. Originality/Value. To our knowledge, this is the first-ever study that compared the results of author-provided keywords with machine-extracted terms with real datasets, which may be an essential lead in the machine learning domain. Replicating these techniques with large datasets from different fields may be a valuable knowledge resource for experts and stakeholders.

Read full abstract

Stopword List Research Articles

Related Topics

Articles published on Stopword List

Stop-Word Lists in Keyphrase Extraction: Their Influence and Comparison

Approaches to improve preprocessing for Latent Dirichlet Allocation topic modeling

Threatening language detection from Urdu data with deep sequential model

Occupational groups prediction in Turkish Twitter data by using machine learning algorithms with multinomial approach

A Systematic Study on the Dilemma and Innovative Path of Rural Family Education Development in the Context of Deep Learning

Filtering Big Data with Optimized Hybrid Algorithm for IoT-Based Data Selection

LiHiSTO: a comprehensive list of Hindi stopwords

A deep CNN architecture with novel pooling layer applied to two Sudanese Arabic sentiment data sets

A study on corpus-based stopword lists in Indian language IR

Comparing research trends through author-provided keywords with machine extracted terms: A ML algorithm approach using publications data on neurological disorders

Civil aviation safety risk intelligent early warning model based on text mining and multi-model fusion

A NEW COMPUTATIONAL MODEL FOR TURKIC LANGUAGES MORPHOLOGY AND PROCESSING

Impact of Similarity Measures in Graph-based Automatic Text Summarization of Konkani Texts

Enhancing relevant concepts extraction for ontology learning using domain time relevance

SEA-PS: Semantic embedding with attention to measuring patent similarity by leveraging various text fields

Creation of a Russian Stop Word List

A Study on the Teaching Design of a Hybrid Civics Course Based on the Improved Attention Mechanism

DBTechVoc: A POS-tagged Vocabulary of Tokens and Lemmata of the Database Technical Domain

Technology theme mining of integrated circuit manufacturing industry chain based on patents

Creation of a Russian stop-word list

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Stopword List Research Articles

Related Topics

Articles published on Stopword List

Stop-Word Lists in Keyphrase Extraction: Their Influence and Comparison

Approaches to improve preprocessing for Latent Dirichlet Allocation topic modeling

Threatening language detection from Urdu data with deep sequential model

Occupational groups prediction in Turkish Twitter data by using machine learning algorithms with multinomial approach

A Systematic Study on the Dilemma and Innovative Path of Rural Family Education Development in the Context of Deep Learning

Filtering Big Data with Optimized Hybrid Algorithm for IoT-Based Data Selection

LiHiSTO: a comprehensive list of Hindi stopwords

A deep CNN architecture with novel pooling layer applied to two Sudanese Arabic sentiment data sets

A study on corpus-based stopword lists in Indian language IR

Comparing research trends through author-provided keywords with machine extracted terms: A ML algorithm approach using publications data on neurological disorders

Civil aviation safety risk intelligent early warning model based on text mining and multi-model fusion

A NEW COMPUTATIONAL MODEL FOR TURKIC LANGUAGES MORPHOLOGY AND PROCESSING

Impact of Similarity Measures in Graph-based Automatic Text Summarization of Konkani Texts

Enhancing relevant concepts extraction for ontology learning using domain time relevance

SEA-PS: Semantic embedding with attention to measuring patent similarity by leveraging various text fields

Creation of a Russian Stop Word List

A Study on the Teaching Design of a Hybrid Civics Course Based on the Improved Attention Mechanism

DBTechVoc: A POS-tagged Vocabulary of Tokens and Lemmata of the Database Technical Domain

Technology theme mining of integrated circuit manufacturing industry chain based on patents

Creation of a Russian stop-word list