Make That Sound More Metallic: Towards a Perceptually Relevant Control of the Timbre of Synthesizer Sounds Using a Variational Autoencoder

Fanny Roche,Thomas Hueber,Maëva Garnier,Laurent Girin,Samuel Limier

doi:10.5334/tismir.76

Abstract

In this article, we propose a new method of sound transformation based on control parameters that are intuitive and relevant for musicians. This method uses a variational autoencoder (VAE) model that is first trained in an unsupervised manner on a large dataset of synthesizer sounds. Then, a perceptual regularization term is added to the loss function to be optimized, and a supervised fine-tuning of the model is carried out using a small subset of perceptually labeled sounds. The labels were obtained from a perceptual test of Verbal Attribute Magnitude Estimation in which listeners rated this training sound dataset along eight perceptual dimensions (French equivalents of metallic, warm, breathy, vibrating, percussive, resonating, evolving, aggressive). These dimensions were identified as relevant for the description of synthesizer sounds in a first Free Verbalization test. The resulting VAE model was evaluated by objective reconstruction measures and a perceptual test. Both showed that the model was able, to a certain extent, to capture the acoustic properties of most of the perceptual dimensions and to transform sound timbre along at least two of them (aggressive and vibrating) in a perceptually relevant manner. Moreover, it was able to generalize to unseen samples even though a small set of labeled sounds was used.

Highlights

Synthesizers are powerful instruments that offer musicians a large palette of possibilities for creating sounds
The objective of the present study is to propose a prototype of an audio synthesizer that can transform the timbre of musical synthetic sounds, by controlling a limited number of perceptual dimensions
4.2 Regularized variational autoencoder (VAE) implementation Considering the results reported by Roche et al (2019) and other similar experiments that we conducted with the ARTURIA dataset, we focused in this experiment on a VAE model of the form [513, 128, enc, 128, 513]

Summary

Introduction

Synthesizers are powerful instruments that offer musicians a large palette of possibilities for creating sounds. The most common synthesis methods (additive synthesis, subtractive synthesis, frequency modulation and physical modeling (Miranda, 2002)) are controlled by low-level parameters that are often numerous and not correlated with musical intent. To broaden the range of possible sounds and improve synthesizers’ ergonomics, it might be better for musicians to control the sound synthesis from a reduced number of higher level dimensions that are more intuitive and directly related to timbre perception. A first issue when searching for these control dimensions is that musical timbre is neither unidimensional nor uniparametric (von Bismarck, 1974). Controlling timbre with a synthesizer involves manipulating several perceptual dimensions, resulting in the joint variation of multiple acoustic descriptors

Objectives

Methods

Findings

Conclusion