Survey on categorical data for neural networks

John T Hancock,Taghi M Khoshgoftaar

doi:10.1186/s40537-020-00305-w

Abstract

This survey investigates current techniques for representing qualitative data for use as input to neural networks. Techniques for using qualitative data in neural networks are well known. However, researchers continue to discover new variations or entirely new methods for working with categorical data in neural networks. Our primary contribution is to cover these representation techniques in a single work. Practitioners working with big data often have a need to encode categorical values in their datasets in order to leverage machine learning algorithms. Moreover, the size of data sets we consider as big data may cause one to reject some encoding techniques as impractical, due to their running time complexity. Neural networks take vectors of real numbers as inputs. One must use a technique to map qualitative values to numerical values before using them as input to a neural network. These techniques are known as embeddings, encodings, representations, or distributed representations. Another contribution this work makes is to provide references for the source code of various techniques, where we are able to verify the authenticity of the source code. We cover recent research in several domains where researchers use categorical data in neural networks. Some of these domains are natural language processing, fraud detection, and clinical document automation. This study provides a starting point for research in determining which techniques for preparing qualitative data for use with neural networks are best. It is our intention that the reader should use these implementations as a starting point to design experiments to evaluate various techniques for working with qualitative data in neural networks. The third contribution we make in this work is a new perspective on techniques for using categorical data in neural networks. We organize techniques for using categorical data in neural networks into three categories. We find three distinct patterns in techniques that identify a technique as determined, algorithmic, or automated. The fourth contribution we make is to identify several opportunities for future research. The form of the data that one uses as an input to a neural network is crucial for using neural networks effectively. This work is a tool for researchers to find the most effective technique for working with categorical data in neural networks, in big data settings. To the best of our knowledge this is the first in-depth look at techniques for working with categorical data in neural networks.

Highlights

There is a spectrum of techniques for using categorical variables in neural networks
Guo and Berkhahn’s results may lead one to believe that entity embedding as a technique for encoding categorical variables is something one should always do
We present a new perspective on encoding techniques

Summary

Introduction

There is a spectrum of techniques for using categorical variables in neural networks. At one end of the spectrum, we have determined encoding techniques. Determined techniques have low running time complexity. Algorithmic techniques may or may not have deterministic outcomes, but we wish to identify them in a class separate from determined techniques because they are more complex in terms of running time. At the other end of the spectrum, we have automatic techniques, where neural networks dynamically generate their own data representations as a part of their training phase. This is the key difference between automatic and algorithmic techniques.

Objectives

Methods

Results

Conclusion