Improving Semi-Supervised Differentiable Synthesizer Sound Matching for Practical Applications

Naotake Masuda,Daisuke Saito

doi:10.1109/taslp.2023.3237161

Abstract

While synthesizers have become commonplace in music production, many users find it difficult to control the parameters of a synthesizer to create a sound as they intended. In order to assist the user, the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">sound matching</i> task aims to estimate synthesis parameters that produce a sound that is as close as possible to the query sound. Recently, neural networks have been employed for this task. These neural networks are trained on paired data of synthesis parameters and the corresponding output sound, optimizing a loss of synthesis parameters. However, query by the user usually consists of real-world sounds, different from the synthesizer output sounds used as training data. In a previous work, the authors presented a sound matching method where the synthesizer is implemented using differentiable DSP. The estimator network could then be trained by directly optimizing the spectral similarity between the original sound and the output sound. Furthermore, the network could be trained on real-world sounds whose ground-truth synthesis parameters are unavailable. This method was shown to improve the match quality in both objective and subjective measures. In this work, we experiment with different synthesizer configurations and extend this approach to a more practical synthesizer with effect modules and envelope generators. We propose a novel training strategy where the network is fully trained using both parameter loss and spectral loss. We show that models trained using this strategy is able to utilize the chorus effect effectively while models that switch completely to spectral loss underutilizes the chorus effect.

Full Text