Use of Speaker Recognition Approaches for Learning and Evaluating Embedding Representations of Musical Instrument Sounds

Xuan Shi,Junichi Yamagishi,Erica Cooper

doi:10.1109/taslp.2022.3140549

Abstract

Constructing an embedding space for musical instrument sounds that can meaningfully represent new and unseen instruments is important for downstream music generation tasks such as multi-instrument synthesis and timbre transfer. The framework of Automatic Speaker Verification (ASV) provides us with architectures and evaluation methodologies for verifying the identities of unseen speakers, and these can be repurposed for the task of learning and evaluating a musical instrument sound embedding space that can support unseen instruments. Borrowing from state-of-the-art ASV techniques, we construct a musical instrument recognition model that uses a SincNet front-end, a ResNet architecture, and an angular softmax objective function. Experiments on the NSynth and RWC datasets show our model’s effectiveness in terms of equal error rate (EER) for unseen instruments, and ablation studies show the importance of data augmentation and the angular softmax objective. Experiments also show the benefit of using a CQT-based filterbank for initializing SincNet over a Mel filterbank initialization. Further complementary analysis of the learned embedding space is conducted with t-SNE visualizations and probing classification tasks, which show that including instrument family labels as a multi-task learning target can help to regularize the embedding space and incorporate useful structure, and that meaningful information such as playing style, which was not included during training, is contained in the embeddings of unseen instruments.

Highlights

M ULTI-INSTRUMENT audio synthesis including timbrestyle transfer is an actively-researched audio generation task in which we disentangle instrument timbre and music content, control and manipulate the timbre, and generate highfidelity natural-sounding audio signals
How can we perform such evaluation and analysis of instrument embedding vectors obtained from audio of unseen instruments? In this paper, we propose to adopt evaluation frameworks used in the Automatic Speaker Verification (ASV) field to answer this scientific question
Using a trained instrument encoder, the first set is used for extracting embedding vectors for enrollment, and the second set is used for measuring similarity to the embedding vector of the same unseen instrument, and dissimilarity to embedding vectors of other unseen instruments included in the test set

Summary

INTRODUCTION

M ULTI-INSTRUMENT audio synthesis including timbrestyle transfer is an actively-researched audio generation task in which we disentangle instrument timbre and music content, control and manipulate the timbre, and generate highfidelity natural-sounding audio signals. A database for training the instrument encoder is typically one including monophonic sounds of many different types of instruments and thanks to similarities of the task and audio signals, we can utilize speaker recognition models directly. We introduced SincNet [16] for feature extraction from raw waveforms, ResNet [17] as the main body, learnable dictionary encoding (LDE) [18] for aggregation of audio signals of varying lengths, and angular softmax [19] for class discriminative training All of these have been reported to perform well on multiple ASV benchmark datasets [16], [19]–[21] and they are expected to make our evaluation and analysis of the instrument embedding vectors obtained from unseen instruments meaningful.

Multi-Instrument Audio Synthesis

Musical Instrument Recognition and Relevant Topics

MUSICAL INSTRUMENT RECOGNITION MODEL INSPIRED BY SPEAKER RECOGNITION

Statistical Significance Analysis

Front-End

Encoding and Temporal Aggregation

Objective Function and Output Layers

ASV BENCHMARKING TECHNIQUES FOR VERIFICATION OF UNSEEN MUSICAL INSTRUMENTS

Experimental Conditions

Experimental Results

Motivation

Details of Shallow Classifiers

Metadata Labels Used for Probing Tasks

CONCLUSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM transactions on audio, speech, and language processing	Publication Date: Jan 1, 2022
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Use of Speaker Recognition Approaches for Learning and Evaluating Embedding Representations of Musical Instrument Sounds

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM transactions on audio, speech, and language processing

Lead the way for us

Similar Papers

Processing of Musical Data Employing Rough Sets and Artificial Neural Networks
Bożena Kostek ... Pawel Żwan
-
Bożena Kostek, et. al.Bożena Kostek ... Pawel Żwan
01 Jan 2004
01 Jan 2004

On Joint Optimization of Automatic Speaker Verification and Anti-Spoofing in the Embedding Space
Alejandro Gomez-Alanis ... S Pavankumar Dubagunta
IEEE Transactions on Information Forensics and Security | VOL. 16
Alejandro Gomez-Alanis, et. al.Alejandro Gomez-Alanis ... S Pavankumar Dubagunta
18 Nov 2020
IEEE Transactions on Information Forensics and Security | VOL. 16

When Automatic Voice Disguise Meets Automatic Speaker Verification
Linlin Zheng ... Xiongwei Zhang
IEEE Transactions on Information Forensics and Security | VOL. 16
Linlin Zheng, et. al.Linlin Zheng ... Xiongwei Zhang
03 Oct 2020
IEEE Transactions on Information Forensics and Security | VOL. 16

Automatic Speaker Verification on Compressed Audio
Oleksandra Sokol ... Volodymyr Husiev
-
Oleksandra Sokol, et. al.Oleksandra Sokol ... Volodymyr Husiev
09 Dec 2022
09 Dec 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Use of Speaker Recognition Approaches for Learning and Evaluating Embedding Representations of Musical Instrument Sounds

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM transactions on audio, speech, and language processing