Stopwords in technical language processing

Serhad Sarica,Jianxi Luo,Diego Raphael Amancio

doi:10.1371/journal.pone.0254937

Abstract

There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications.

Highlights

Natural language processing (NLP) and text analysis have been growingly popular in engineering analytics [1,2,3,4,5,6]
There have been efforts to identify stopwords from generic knowledge sources such as Brown Corpus [10, 12], 20 newsgroup corpus [8], books corpus [13], etc, and curate a generic stopwords list for removal in NLP applications across fields. The use of such a standard stopwords list, e.g. the one distributed with the popular Natural Language Tool Kit (NLTK) [14] python package, for removal in data pre-processing has become an NLP standard in both research and industry
The dataset is further filtered by selecting the patents with at least one stopword from the NLTK+USPTO set and at least one stopword from the new list introduced in this study

Summary

Introduction

Natural language processing (NLP) and text analysis have been growingly popular in engineering analytics [1,2,3,4,5,6]. There have been efforts to identify stopwords from generic knowledge sources such as Brown Corpus [10, 12], 20 newsgroup corpus [8], books corpus [13], etc, and curate a generic stopwords list for removal in NLP applications across fields The use of such a standard stopwords list, e.g. the one distributed with the popular Natural Language Tool Kit (NLTK) [14] python package, for removal in data pre-processing has become an NLP standard in both research and industry. Researchers, analysts, and engineers working on technology-related textual data and technical language analysis can directly apply it to denoise and filter their technical textual data without conducting the manual and ad hoc discovery and removal of uninformative words by themselves We exemplified such a use case to measure the effectiveness of our new stopwords dataset in text classification tasks

Proposed approach

Pre-processing

Term statistics

Human evaluation

Final list

Case study evaluation

Text classification

Concluding remarks

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLOS ONE	Publication Date: Aug 5, 2021
Citations: 54	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Stopwords in technical language processing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

A Survey of Topic Models in Text Classification
Linzhong Xia ... Chunxiao Zhang
-
Linzhong Xia, et. al.Linzhong Xia ... Chunxiao Zhang
01 May 2019
01 May 2019

Automatic Identification of Stop Words in Chinese Text Classification
Lili Hao ... Lizhu Hao
-
Lili Hao, et. al.Lili Hao ... Lizhu Hao
01 Jan 2008
01 Jan 2008

Term Weighting for Feature Extraction on Twitter: A Comparison Between BM25 and TF-IDF
Ammar Ismael Kadhim
-
Ammar Ismael KadhimAmmar Ismael Kadhim
01 Apr 2019
01 Apr 2019

Survey on supervised machine learning techniques for automatic text classification
Ammar Ismael Kadhim
Artificial Intelligence Review | VOL. 52
Ammar Ismael KadhimAmmar Ismael Kadhim
19 Jan 2019
Artificial Intelligence Review | VOL. 52

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Stopwords in technical language processing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE