Abstract

Constructing an embedding space for musical instrument sounds that can meaningfully represent new and unseen instruments is important for downstream music generation tasks such as multi-instrument synthesis and timbre transfer. The framework of Automatic Speaker Verification (ASV) provides us with architectures and evaluation methodologies for verifying the identities of unseen speakers, and these can be repurposed for the task of learning and evaluating a musical instrument sound embedding space that can support unseen instruments. Borrowing from state-of-the-art ASV techniques, we construct a musical instrument recognition model that uses a SincNet front-end, a ResNet architecture, and an angular softmax objective function. Experiments on the NSynth and RWC datasets show our model’s effectiveness in terms of equal error rate (EER) for unseen instruments, and ablation studies show the importance of data augmentation and the angular softmax objective. Experiments also show the benefit of using a CQT-based filterbank for initializing SincNet over a Mel filterbank initialization. Further complementary analysis of the learned embedding space is conducted with t-SNE visualizations and probing classification tasks, which show that including instrument family labels as a multi-task learning target can help to regularize the embedding space and incorporate useful structure, and that meaningful information such as playing style, which was not included during training, is contained in the embeddings of unseen instruments.

Highlights

  • M ULTI-INSTRUMENT audio synthesis including timbrestyle transfer is an actively-researched audio generation task in which we disentangle instrument timbre and music content, control and manipulate the timbre, and generate highfidelity natural-sounding audio signals

  • How can we perform such evaluation and analysis of instrument embedding vectors obtained from audio of unseen instruments? In this paper, we propose to adopt evaluation frameworks used in the Automatic Speaker Verification (ASV) field to answer this scientific question

  • Using a trained instrument encoder, the first set is used for extracting embedding vectors for enrollment, and the second set is used for measuring similarity to the embedding vector of the same unseen instrument, and dissimilarity to embedding vectors of other unseen instruments included in the test set

Read more

Summary

INTRODUCTION

M ULTI-INSTRUMENT audio synthesis including timbrestyle transfer is an actively-researched audio generation task in which we disentangle instrument timbre and music content, control and manipulate the timbre, and generate highfidelity natural-sounding audio signals. A database for training the instrument encoder is typically one including monophonic sounds of many different types of instruments and thanks to similarities of the task and audio signals, we can utilize speaker recognition models directly. We introduced SincNet [16] for feature extraction from raw waveforms, ResNet [17] as the main body, learnable dictionary encoding (LDE) [18] for aggregation of audio signals of varying lengths, and angular softmax [19] for class discriminative training All of these have been reported to perform well on multiple ASV benchmark datasets [16], [19]–[21] and they are expected to make our evaluation and analysis of the instrument embedding vectors obtained from unseen instruments meaningful.

Multi-Instrument Audio Synthesis
Musical Instrument Recognition and Relevant Topics
MUSICAL INSTRUMENT RECOGNITION MODEL INSPIRED BY SPEAKER RECOGNITION
Statistical Significance Analysis
Front-End
Encoding and Temporal Aggregation
Objective Function and Output Layers
ASV BENCHMARKING TECHNIQUES FOR VERIFICATION OF UNSEEN MUSICAL INSTRUMENTS
Experimental Conditions
Experimental Results
Motivation
Details of Shallow Classifiers
Metadata Labels Used for Probing Tasks
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call