A new sentence embedding framework for the education and professional training domain with application to hierarchical multi-label text classification

Guillaume Lefebvre,Haytham Elghazel,Theodore Guillet,Alexandre Aussem,Matthieu Sonnati

doi:10.1016/j.datak.2024.102281

Abstract

In recent years, Natural Language Processing (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification. However, complexity increases with hierarchical multi-label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific-domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi-label text classification approach. This innovative framework chains multiple classifiers, where each individual classifier is built using a novel sentence-embedding method BERTEPro based on existing Transformer models, whose pre-training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain-specific hierarchical multi-label classification. Experiments over three domain-specific textual HMC datasets indicate the effectiveness of HMCCCProbT to compare favorably to state-of-the-art HMC algorithms in terms of classification accuracy and also the ability of BERTEPro to obtain better probability predictions, well suited to HMCCCProbT, than three other vector representation techniques.

Full Text