High-Fidelity Audio Generation and Representation Learning With Guided Adversarial Autoencoder

Kazi Nazmul Haque,Rajib Rana,Bjorn W Schuller

doi:10.1109/access.2020.3040797

Abstract

Generating high-fidelity conditional audio samples and learning representation from unlabelled audio data are two challenging problems in machine learning research. Recent advances in the Generative Adversarial Neural Networks (GAN) architectures show great promise in addressing these challenges. To learn powerful representation using GAN architecture, it requires superior sample generation quality, which requires an enormous amount of labelled data. In this paper, we address this issue by proposing Guided Adversarial Autoencoder (GAAE), which can generate superior conditional audio samples from unlabelled audio data using a small percentage of labelled data as guidance. Representation learned from unlabelled data without any supervision does not guarantee its’ usability for any downstream task. On the other hand, during the representation learning, if the model is highly biased towards the downstream task, it losses its generalisation capability. This makes the learned representation hardly useful for any other tasks that are not related to that downstream task. The proposed GAAE model also address these issues. Using this superior conditional generation, GAAE can learn representation specific to the downstream task. Furthermore, GAAE learns another type of representation capturing the general attributes of the data, which is independent of the downstream task at hand. Experimental results involving the S09 and the NSynth dataset attest the superior performance of GAAE compared to the state-of-the-art alternatives.

Highlights

R EPRESENTATION learning aims to map higherdimensional data into a lower-dimensional representation space where the variational factors of the data are disentangled
IMPACT OF LABELLED DATA FOR CONDITIONAL SAMPLE GENERATION 1) Setup First, we evaluate the conditional sample generation quality of the Guided Adversarial Autoencoder (GAAE) model for different percentage of labelled data (1% - 5 %, 100%) as guidance
2) Results and Discussions The percentage of labelled training data used as guidance has a significant impact on the Inception Score (IS) and Fréchet Inception Distance (FID) score, which can be found from the table 3

Summary

Introduction

R EPRESENTATION learning aims to map higherdimensional data into a lower-dimensional representation space where the variational factors of the data are disentangled. The Generator tries to fool the Discriminator by generating reallike samples from a random noise/latent distribution, and the Discriminator tries to defeat the Generator by differentiating the generated sample from the real samples [2] During this game-play, the Generator disentangles the underlying attributes of the data in the given random latent distribution [3]. This helps in learning powerful representations [3]–[9] in a unsupervised manner. DeepMind [27] have proposed a model to learn a useful representation from unsupervised speech data through predicting a future observation in the latent space. There are other successful implementations [33]–[35] of the self-supervised representation learning in the field of audio

Objectives

Results

Conclusion