Experimental Comparison of Unsupervised Approaches in the Task of Separating Specializations Within Professions in Job Vacancies

Mikhail Vinel,Ivan Ryazanov,Dmitriy Botov,Ivan Nikolaev

doi:10.1007/978-3-030-34518-1_7

Abstract

In this article an unsupervised approach for analysis of labor market requirements allowing to solve the problem of discovering latent specializations within broadly defined professions is presented. For instance, for the profession of “programmer” such specializations could be “CNC programmer”, “mobile developer”, “frontend developer” and so on. Various statistical methods of text vector representations: TF-IDF, probabilistic topic modeling, neural language models based on distributional semantics (word2vec, fasttext) and deep contextualized word representation (ELMo and multilingual BERT) have been experimentally evaluated. Both pre-trained models and models trained on the texts of job vacancies in Russian have been investigated. The experiments were conducted on dataset provided by online recruitment platforms. Several types of clustering methods: K-means, Affinity Propagation, BIRCH, Agglomerative clustering, and HDBSCAN have been tested. In case of predetermined clusters’ number (k-means, agglomerative) the best result was achieved by ARTM. However, if the number of clusters was not specified ahead of time, word2vec trained on our job vacancies dataset has outperformed other models. The models trained on our corpora perform much better than pre-trained models with large even multilingual vocabulary.

Full Text