Unsupervised classification of high-dimension and low-sample data with variational autoencoder based dimensionality reduction

Mohammad Sultan Mahmud,Xianghua Fu

doi:10.1109/icarm.2019.8834333

Abstract

In data mining research and development, one of the defining challenges is to perform classification or clustering tasks for relatively limited-samples with high-dimensions data, also known as high-dimensional limited-sample size (HDLSS) problem. Due to the limited-sample-size, there is a lack of enough training data to train classification models. Also, the ‘curse of dimensionality’ aspect is often a restriction on the effectiveness of many methods for solving HDLSS problem. Classification model with limited-sample dataset lead to overfitting and cannot achieve a satisfactory result. Thus, the unsupervised method is a better choice to solve such problems. Due to the emergence of deep learning, their plenty of applications and promising outcome, it is required an extensive analysis of the deep learning technique on HDLSS dataset. This paper aims at evaluating the performance of variational autoencoder (VAE) based dimensionality reduction and unsupervised classification on the HDLSS dataset. The performance of VAE is compared with two existing techniques namely PCA and NMF on fourteen datasets in term of three evaluation metrics namely purity, Rand index, and NMI. The experimental result shows the superiority of VAE over the traditional methods on the HDLSS dataset.

Full Text