Research on Singing Voice Detection Based on a Long-Term Recurrent Convolutional Network with Vocal Separation and Temporal Smoothing

Xulong Zhang,Yi Yu,Xi Chen,Wei Li,Yongwei Gao

doi:10.3390/electronics9091458

Xulong Zhang, Yi Yu + Show 3 more

Open Access

https://doi.org/10.3390/electronics9091458

Copy DOI

Journal: Electronics	Publication Date: Sep 7, 2020
Citations: 26	License type: CC BY 4.0

Affiliation: National Institute of Informatics, Fudan University

Abstract

Singing voice detection or vocal detection is a classification task that determines whether a given audio segment contains singing voices. This task plays a very important role in vocal-related music information retrieval tasks, such as singer identification. Although humans can easily distinguish between singing and nonsinging parts, it is still very difficult for machines to do so. Most existing methods focus on audio feature engineering with classifiers, which rely on the experience of the algorithm designer. In recent years, deep learning has been widely used in computer hearing. To extract essential features that reflect the audio content and characterize the vocal context in the time domain, this study adopted a long-term recurrent convolutional network (LRCN) to realize vocal detection. The convolutional layer in LRCN functions in feature extraction, and the long short-term memory (LSTM) layer can learn the time sequence relationship. The preprocessing of singing voices and accompaniment separation and the postprocessing of time-domain smoothing were combined to form a complete system. Experiments on five public datasets investigated the impacts of the different features for the fusion, frame size, and block size on LRCN temporal relationship learning, and the effects of preprocessing and postprocessing on performance, and the results confirm that the proposed singing voice detection algorithm reached the state-of-the-art level on public datasets.

Highlights

It is not difficult for humans to identify the singing segments in a piece of music, and such identification is seldom affected by voice types, pronunciation changes, background music, or even language forms [1]
This paper proposes a new data-driven singing voice detection method based on long-term recurrent convolutional network (LRCN) [23]
Five public datasets are mainly used in the related works of singing voice detection

Summary

Introduction

It is not difficult for humans to identify the singing segments in a piece of music, and such identification is seldom affected by voice types, pronunciation changes, background music, or even language forms [1]. A conventional method combines characteristics of speech with statistical classifiers to detect and recognize singing voice segments in songs [6]. Features such as Mel-frequency cepstral coefficients (MFCC), linear predictive cepstral components (LPCC), and classifiers such as Gaussian mixture models (GMM), artificial neural networks, support vector machines (SVM), and the Hidden. Features and statistical classification methods used in speech recognition have certain limitations for singing voice detection. Deep learning based on its powerful feature representation capabilities and time and space modeling capabilities has begun to be applied in singing voice detection [10,11]

Objectives

Methods

Results

Conclusion