Abstract

There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications.

Highlights

  • Natural language processing (NLP) and text analysis have been growingly popular in engineering analytics [1,2,3,4,5,6]

  • There have been efforts to identify stopwords from generic knowledge sources such as Brown Corpus [10, 12], 20 newsgroup corpus [8], books corpus [13], etc, and curate a generic stopwords list for removal in NLP applications across fields. The use of such a standard stopwords list, e.g. the one distributed with the popular Natural Language Tool Kit (NLTK) [14] python package, for removal in data pre-processing has become an NLP standard in both research and industry

  • The dataset is further filtered by selecting the patents with at least one stopword from the NLTK+USPTO set and at least one stopword from the new list introduced in this study

Read more

Summary

Introduction

Natural language processing (NLP) and text analysis have been growingly popular in engineering analytics [1,2,3,4,5,6]. There have been efforts to identify stopwords from generic knowledge sources such as Brown Corpus [10, 12], 20 newsgroup corpus [8], books corpus [13], etc, and curate a generic stopwords list for removal in NLP applications across fields The use of such a standard stopwords list, e.g. the one distributed with the popular Natural Language Tool Kit (NLTK) [14] python package, for removal in data pre-processing has become an NLP standard in both research and industry. Researchers, analysts, and engineers working on technology-related textual data and technical language analysis can directly apply it to denoise and filter their technical textual data without conducting the manual and ad hoc discovery and removal of uninformative words by themselves We exemplified such a use case to measure the effectiveness of our new stopwords dataset in text classification tasks

Proposed approach
Pre-processing
Term statistics
Human evaluation
Final list
Case study evaluation
Text classification
Concluding remarks
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call