Abstract

In low-resource domains, it is challenging to achieve good performance using existing machine learning methods due to a lack of training data and mixed data types (numeric and categorical). In particular, categorical variables with high cardinality pose a challenge to machine learning tasks such as classification and regression because training requires sufficiently many data points for the possible values of each variable. Since interpolation is not possible, nothing can be learned for values not seen in the training set. This paper presents a method that uses prior knowledge of the application domain to support machine learning in cases with insufficient data. We propose to address this challenge by using embeddings for categorical variables that are based on an explicit representation of domain knowledge (KR), namely a hierarchy of concepts. Our approach is to 1. define a semantic similarity measure between categories, based on the hierarchy—we propose a purely hierarchy-based measure, but other similarity measures from the literature can be used—and 2. use that similarity measure to define a modified one-hot encoding. We propose two embedding schemes for single-valued and multi-valued categorical data. We perform experiments on three different use cases. We first compare existing similarity approaches with our approach on a word pair similarity use case. This is followed by creating word embeddings using different similarity approaches. A comparison with existing methods such as Google, Word2Vec and GloVe embeddings on several benchmarks shows better performance on concept categorisation tasks when using knowledge-based embeddings. The third use case uses a medical dataset to compare the performance of semantic-based embeddings and standard binary encodings. Significant improvement in performance of the downstream classification tasks is achieved by using semantic information.

Highlights

  • In machine learning, standard tasks such as classification and clustering perform well on datasets that contain a large number of training samples

  • We observe that semantic-based embedding schemes for categorical variables perform better than the traditional one-hot encoding of high-dimensional categorical variables in both settings: single-valued and multi-valued encodings

  • The highest recall score of 79% is achieved for single-valued embeddings using the Poly-hierarchy semantic similarity (PS) technique, which is significantly better than one-hot

Read more

Summary

Introduction

Standard tasks such as classification and clustering perform well on datasets that contain a large number of training samples. In complex and low-resource domains, the datasets have low sample size and mixed features (numeric and categorical features). Categorical features in such scenarios have high cardinality. The combination of these factors makes it challenging to achieve good performance in low-resource domains. Our goal is to improve machine learning performance in low-resource scenarios where high-quality structured knowledge (in the form of a hierarchy) is available. This structured knowledge can be mapped to the categorical data in the domain

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call