Abstract

Singing voice detection is still a challenging task because the voice can be obscured by instruments having the same frequency band, and even the same timbre, produced by mimicking the mechanism of human singing. Because of the poor adaptability and complexity of feature engineering, there is a recent trend towards feature learning in which deep neural networks play the roles of feature extraction and classification. In this paper, we present two methods to explore the channel properties in the convolution neural network to improve the performance of singing voice detection by feature learning. First, channel attention learning is presented to measure the importance of a feature, in which two attention mechanisms are exploited, i.e., the scaled dot-product and squeeze-and-excitation. This method focuses on learning the importance of the feature map so that the neurons can place more attention on the more important feature maps. Second, the multi-scale representations are fed to the input channels, aiming at adding more information in terms of scale. Generally, different songs need different scales of a spectrogram to be represented, and multi-scale representations ensure the network can choose the best one for the task. In the experimental stage, we proved the effectiveness of the two methods based on three public datasets, with the accuracy performance increasing by up to 2.13 percent compared to its already high initial level.

Highlights

  • Singing voice detection (SVD) is a task used to discriminate whether an audio segment contains at least one person’s singing voice

  • We focused on exploring channel properties to improve the SVD performance based on the condition that a Convolution Neural Networks (CNNs) is used for feature learning with low-level representation input, i.e., the log-mel spectrogram

  • We think that Mir1k is the “easiest” dataset and its potential is almost completely exploited by the inferior method, i.e., the baseline CNN, so that the better methods cannot significantly improve the results

Read more

Summary

Introduction

Singing voice detection (SVD) is a task used to discriminate whether an audio segment contains at least one person’s singing voice. Some instruments mimic the mechanism of human singing, and share the same frequency bands and the timbre characteristics as those of the singing voice. This increases the difficulty of the SVD task. In the field of music information retrieval (MIR), SVD is fundamental and crucial for various tasks, such as singing voice separation [1], singer identification [2], and lyrics alignment [3]

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call