Sparsity-based representation for categorical data

K Srindhya,Remya Menon,M R Kaimal,Shruthi S Nair

doi:10.1109/raics.2013.6745450

Abstract

Over the past few decades, many algorithms have been continuously evolving in the area of machine learning. This is an era of big data which is generated by different applications related to various fields like medicine, the World Wide Web, E-learning networking etc. So, we are still in need for more efficient algorithms which are computationally cost effective and thereby producing faster results. Sparse representation of data is one giant leap toward the search for a solution for big data analysis. The focus of our paper is on algorithms for sparsity-based representation of categorical data. For this, we adopt a concept from the image and signal processing domain called dictionary learning. We have successfully implemented its sparse coding stage which gives the sparse representation of data using Orthogonal Matching Pursuit (OMP) algorithms (both Batch and Cholesky based) and its dictionary update stage using the Singular Value Decomposition (SVD). We have also used a preprocessing stage where we represent the categorical dataset using a vector space model based on the TF-IDF weighting scheme. Our paper demonstrates how input data can be decomposed and approximated as a linear combination of minimum number of elementary columns of a dictionary which so formed will be a compact representation of data. Classification or clustering algorithms can now be easily performed based on the generated sparse coded coefficient matrix or based on the dictionary. We also give a comparison of the dictionary learning algorithm when applying different OMP algorithms. The algorithms are analysed and results are demonstrated by synthetic tests and on real data.

Full Text