High level feature extraction for the self-taught learning algorithm

Konstantin Markov,Tomoko Matsui

doi:10.1186/1687-4722-2013-6

Abstract

Availability of large amounts of raw unlabeled data has sparked the recent surge in semi-supervised learning research. In most works, however, it is assumed that labeled and unlabeled data come from the same distribution. This restriction is removed in the self-taught learning algorithm where unlabeled data can be different, but nevertheless have similar structure. First, a representation is learned from the unlabeled samples by decomposing their data matrix into two matrices called bases and activations matrix respectively. This procedure is justified by the assumption that each sample is a linear combination of the columns in the bases matrix which can be viewed as high level features representing the knowledge learned from the unlabeled data in an unsupervised way. Next, activations of the labeled data are obtained using the bases which are kept fixed. Finally, a classifier is built using these activations instead of the original labeled data. In this work, we investigated the performance of three popular methods for matrix decomposition: Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF) and Sparse Coding (SC) as unsupervised high level feature extractors for the self-taught learning algorithm. We implemented this algorithm for the music genre classification task using two different databases: one as unlabeled data pool and the other as data for supervised classifier training. Music pieces come from 10 and 6 genres for each database respectively, while only one genre is common for the both of them. Results from wide variety of experimental settings show that the self-taught learning method improves the classification rate when the amount of labeled data is small and, more interestingly, that consistent improvement can be achieved for a wide range of unlabeled data sizes. The best performance among the matrix decomposition approaches was shown by the Sparse Coding method.

Highlights

A tremendous amount of music-related data has recently become available either locally or remotely over networks, and technology for searching this content and retrieving music-related information efficiently is demanded
This suggests that the input data are highly redundant and that the information captured by the eigenvectors is proportional to the data set size
In this study, we investigated the performance of several matrix decomposition methods, such as Principal Component Analysis (PCA), negative Matrix Factorization (NMF) and sparse coding when applied for high level feature extraction in the self-taught learning algorithm with respect to the music genre classification task

Summary

Introduction

A tremendous amount of music-related data has recently become available either locally or remotely over networks, and technology for searching this content and retrieving music-related information efficiently is demanded. Recent music information retrieval research has been increasingly adopting semi-supervised learning methods where unlabeled data are utilized to help the classification task. Common assumption, in this case, is that both labeled and unlabeled data come from the same distribution [1] which, may not be achieved during the data collection. Utilizing this framework and the semi-supervised learning ideas, the recently proposed self-taught learning algorithm [3] appears to be a good candidate for the kind of music genre classification task described above According to this algorithm, first, a highlevel representation of the unlabeled data is found in an unsupervised manner. In order to be able to do a fair comparison, we decided to use the original PCA and NMF rather than their sparse derivatives

Objectives

Methods

Results

Discussion

Conclusion