Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class Balancing

Maria Nefeli Nikiforos,Katia Lida Kermanidis,Adamantia Pateli,Konstantina Deliveri

doi:10.3390/computers12060111

Abstract

Highly-skilled migrants and refugees finding employment in low-skill vocations, despite professional qualifications and educational backgrounds, has become a global tendency, mainly due to the language barrier. Employment prospects for displaced communities are mostly decided by their knowledge of the sublanguage of the vocational domain they are interested in working. Common vocational domains include agriculture, cooking, crafting, construction, and hospitality. The increasing amount of user-generated content in wikis and social networks provides a valuable source of data for data mining, natural language processing, and machine learning applications. This paper extends the contribution of the authors’ previous research on automatic vocational domain identification by further analyzing the results of machine learning experiments with a domain-specific textual data set while considering two research directions: a. prediction analysis and b. data balancing. Wrong prediction analysis and the features that contributed to misclassification, along with correct prediction analysis and the features that were the most dominant, contributed to the identification of a primary set of terms for the vocational domains. Data balancing techniques were applied on the data set to observe their impact on the performance of the classification model. A novel four-step methodology was proposed in this paper for the first time, which consists of successive applications of SMOTE oversampling on imbalanced data. Data oversampling obtained better results than data undersampling in imbalanced data sets, while hybrid approaches performed reasonably well.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computers	Publication Date: May 24, 2023
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class Balancing

Abstract

Talk to us

Similar Papers

More From: Computers

Lead the way for us

Similar Papers

Machine Learning on Wikipedia Text for the Automatic Identification of Vocational Domains of Significance for Displaced Communities
Maria Nefeli Nikiforos ... Katia Lida Kermanidis
-
Maria Nefeli Nikiforos, et. al.Maria Nefeli Nikiforos ... Katia Lida Kermanidis
03 Nov 2022
03 Nov 2022

Feature selection for classification using WGCNA and Spread Sub-Sample for an imbalanced rheumatoid arthritis RNASEQ data
Consolata Gakii ... Boaz Too
Informatics in Medicine Unlocked | VOL. 43
Consolata Gakii, et. al.Consolata Gakii ... Boaz Too
01 Jan 2023
Informatics in Medicine Unlocked | VOL. 43

Natural Language Processing Basics.
Naveen Arivazhagan ... Tielman T Van Vleck
Clinical Journal of the American Society of Nephrology | VOL. 18
Naveen Arivazhagan, et. al.Naveen Arivazhagan ... Tielman T Van Vleck
08 Feb 2023
Clinical Journal of the American Society of Nephrology | VOL. 18

Combining concept maps and interviews to produce representations of personal professional theories in higher vocational education: effects of order and vocational domain
Antoine C M Van Den Bogaart ... Harmen Schaap
Instructional Science | VOL. 45
Antoine C M Van Den Bogaart, et. al.Antoine C M Van Den Bogaart ... Harmen Schaap
10 Feb 2017
Instructional Science | VOL. 45

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class Balancing

Abstract

Talk to us

Similar Papers

More From: Computers