Abstract
■■■Filter banks on short-time Fourier transform (STFT) spectrogram have long been studied to analyze and process audios. The frameshift in STFT procedure determines the temporal resolution. However, in many discriminative audio applications, long-term time and frequency correlations are needed. The authors in this work use Toeplitz matrix motivated filter banks to extract long-term time and frequency information. This paper investigates the mechanism of long-term filter banks and the corresponding spectrogram reconstruction method. The time duration and shape of the filter banks are well designed and learned using neural networks. We test our approach on different tasks. The spectrogram reconstruction error in audio source separation task is reduced by relatively 6.7% and the classification error in audio scene classification task is reduced by relatively 6.5%, when compared with the traditional frequency filter banks. The experiments also show that the time duration of long-term filter banks in classification task is much larger than in reconstruction task.
Highlights
Audios in a realistic environment are typically composed of different sound sources
The time duration of long-term filter banks is limited by σk, the strength of each frequency bin is reconstructed by αk, the total number of parameters reduces from 2mT in Eq 2 to 2m in Eq 3
5 Conclusions A novel framework of filter banks that can extract longterm time and frequency correlation is proposed in this paper
Summary
Audios in a realistic environment are typically composed of different sound sources. Yet humans have no problem in organizing the elements into their sources to recognize the acoustic environment. Neural networks organized into a twodimensional space have been proposed to model the time and frequency organization of audio elements by Wang and Chang [22]. They utilized two-dimensional Gaussian lateral connectivity and global inhibition to parameterize the network, where the two dimensions correspond to frequency and time respectively. The time duration is different, but for each frame, the filter shape is constant This mechanism can be implemented using a Toeplitz matrix motivated network.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: EURASIP Journal on Audio, Speech, and Music Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.