Abstract

Although natural language processing (NLP) refers to a process involving the development of algorithms or computational models that empower machines to understand, interpret, and generate human language, machines are still unable to fully grasp the meanings behind words. Specifically, they cannot assist humans in categorizing words with general or technical purposes without predefined standards or baselines. Empirically, prior researches have relied on inefficient manual tasks to exclude these words when extracting technical words (i.e., terminology or terms used within a specific field or domain of expertise) for obtaining domain information from the target corpus. Therefore, to enhance the efficiency of extracting domain-oriented technical words in corpus analysis, this paper proposes a machine-based corpus optimization method that compiles an advanced general-purpose word list (AGWL) to serve as the exclusion baseline for the machine to extract domain-oriented technical words. To validate the proposed method, this paper utilizes 52 COVID-19 research articles as the target corpus and an empirical example. After compared to traditional methods, the proposed method offers significant contributions: (1) it can automatically eliminate the most common function words in corpus data; (2) through a machine-driven process, it removes general-purpose words with high frequency and dispersion rates –57% of word types belonging to general-purpose words, constituting 90% of the total words in the target corpus. This results in 43% of word types representing domain-oriented technical words that makes up 10% of the total words in the target corpus are able to be extracted. This allows future researchers to focus exclusively on the remaining 43% of word types in the optimized word list (OWL), enhancing the efficiency of corpus analysis for extracting domain knowledge. (3) The proposed method establishes a set of standard operation procedure (SOP) that can be duplicated and generally applied to optimize any corpus data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call