Singing Voice Detection: A Survey.

Ramy Monir,Daniel Kostrzewa,Dariusz Mrozek

doi:10.3390/e24010114

Ramy Monir, Daniel Kostrzewa + Show 1 more

Open Access

https://doi.org/10.3390/e24010114

Copy DOI

Abstract

Singing voice detection or vocal detection is a classification task that determines whether there is a singing voice in a given audio segment. This process is a crucial preprocessing step that can be used to improve the performance of other tasks such as automatic lyrics alignment, singing melody transcription, singing voice separation, vocal melody extraction, and many more. This paper presents a survey on the techniques of singing voice detection with a deep focus on state-of-the-art algorithms such as convolutional LSTM and GRU-RNN. It illustrates a comparison between existing methods for singing voice detection, mainly based on the Jamendo and RWC datasets. Long-term recurrent convolutional networks have reached impressive results on public datasets. The main goal of the present paper is to investigate both classical and state-of-the-art approaches to singing voice detection.

Highlights

In this paper, we would like to fill this gap, and we investigate the classical approaches of singing voice detection (SVD) systems [13] which focus on the acoustic similarity between singing voice and speech, using cepstral coefficients [13] and linear predictive coding [14]
The authors found out that discrete Fourier transform (DFT) coefficients achieved higher detection accuracy evaluated on all epochs over the average of 10 trials which is higher than Mel-frequency cepstrum coefficients (MFCCs) and raw pulse-code modulation (PCM)
The results show that Long Short-Term Memory (LSTM)-recurrent neural network (RNN) outperforms all other methods in statistical benchmarks