Abstract

Herein, we proposed a multi-scale multi-band dilated time-frequency densely connected convolutional network (DenseNet) with long short-term memory (LSTM) for audio source separation. Because the spectrogram of the acoustic signal can be thought of as images as well as time series data, it is suitable for convolutional recurrent neural network (CRNN) architecture. We improved the audio source separation performance by applying the dilated block with a dilated convolution to CRNN architecture. The dilated block has the role of effectively increasing the receptive field in the spectrogram. In addition, it was designed in consideration of the acoustic characteristics that the frequency axis and the time axis in the spectrogram are changed by independent influences such as speech rate and pitch. In speech enhancement experiments, we estimated the speech signal using various deep learning architectures from a signal in which the music, noise, and speech were mixed. We conducted the subjective evaluation on the estimated speech signal. In addition, speech quality, intelligibility, separation, and speech recognition performance were also measured. In music signal separation, we estimated the music signal using several deep learning architectures from the mixture of the music and speech signal. After that, the separation performance and music identification accuracy were measured using the estimated music signal. Overall, the proposed architecture shows the best performance compared to other deep learning architectures not only in speech experiments but also in music experiments.

Highlights

  • In a real environment, humans hear several mixed signals simultaneously

  • Dividing the band at 2 kHz and 4 kHz was applied to DilDenseNet, MMDenseLSTM, and proposed architecture

  • We proposed an MMDilDenseLSTM for speech recognition or music identification after audio source separation

Read more

Summary

Introduction

In these situations, we can selectively attend to a signal we want, effectively segregating a target from the perceived mixture. We can selectively attend to a signal we want, effectively segregating a target from the perceived mixture This is the so-called auditory scene analysis or the cocktail effect problem [1], which is the main topic of this paper. Audio source separation is used to estimate the target signal, such as speech and music, when the target signal and other signals are mixed. If the target signal to be estimated is speech, it is the speech enhancement task; if the target signal is music, it is the music signal separation task

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call