Abstract

Institutions that provide official statistics tend to use external data sources such as administrative data sources besides regular statistical surveys. In addition to the mentioned data sources, Big Data became recognized as a new data source for the provider of official statistics. Classification of textual data is one of the elementary tasks for the provider of official statistics, regardless of data sources. In this paper, application of traditional machine learning algorithms, Multinomial Naive Bayes and Support Vector Machine, for the classification of advertised jobs according to ISCO-08, has been presented. The paper presents the methods of collecting data on advertised jobs from four websites and procedures for creating a multilingual dataset. There are different types of text preprocessing, such as converting uppercase letters into lowercase letters, stopword removal, punctuation mark removal, lemmatization, correction of commonly misspelled words, and reduction of replicated characters. We hypothesized that the application of different combinations of preprocessing methods influenced the text classification results. Two experiments had conducted to test the hypothesis. Both experiments results showed that using the Support Vector Machine algorithm on a created dataset gives better results than Multinomial Naive Bayes. Performed experiments showed that the proposed algorithms gave a good performance with an overall accuracy of up to 90% but with different accuracy for individual classes due to an imbalanced dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.