Abstract

This paper proposes a multichannel environmental sound segmentation method. Environmental sound segmentation is an integrated method to achieve sound source localization, sound source separation and classification, simultaneously. When multiple microphones are available, spatial features can be used to improve the localization and separation accuracy of sounds from different directions; however, conventional methods have three drawbacks: (a) Sound source localization and sound source separation methods using spatial features and classification using spectral features trained in the same neural network, may overfit to the relationship between the direction of arrival and the class of a sound, thereby reducing their reliability to deal with novel events. (b) Although permutation invariant training used in autonomous speech recognition could be extended, it is impractical for environmental sounds that include an unlimited number of sound sources. (c) Various features, such as complex values of short time Fourier transform and interchannel phase differences have been used as spatial features, but no study has compared them. This paper proposes a multichannel environmental sound segmentation method comprising two discrete blocks, a sound source localization and separation block and a sound source separation and classification block. By separating the blocks, overfitting to the relationship between the direction of arrival and the class is avoided. Simulation experiments using created datasets including 75-class environmental sounds showed the root mean squared error of the proposed method was lower than that of conventional methods.

Highlights

  • Various methods such as sound source localization (SSL), sound source separation (SSS), and classification have been proposed in acoustic signal processing, robot audition, and machine learning for use in real-world environments containing multiple overlapping sound events [1,2,3].Conventional approaches use the cascade method, incorporating individual functions based on array signal processing techniques [4,5,6]

  • The sound source localization and separation (SSLS) block does not completely separate sounds arriving from a close direction, and the errors caused by the SSLS block accumulate

  • The proposed structure, in which the classification block of the SSLS + Classification structure was replaced by the separation and classification (SSSC) block, clearly had a smaller RMSE; the SSLS + Classification structure did not have the ability to correct the errors that occurred in the SSLS, but the SSSC block with inclusion of the separation feature reduced the propagation of errors in the SSLS block

Read more

Summary

Introduction

Various methods such as sound source localization (SSL), sound source separation (SSS), and classification have been proposed in acoustic signal processing, robot audition, and machine learning for use in real-world environments containing multiple overlapping sound events [1,2,3]. In addition to the magnitude spectra, using interchannel phase difference (IPD) between microphones as spatial features have been reported to improve ASR performance in overlapping sounds containing multiple speakers. Deep learning-based methods for sound event localization and detection (SELD) have been proposed [18,19,20,21] These methods simultaneously perform SSL and sound event detection (SED) of environmental sounds. – Comparison of various spatial features revealed the sine and cosine of IPDs to be optimum for sound source localization and separation

Multichannel autonomous speech recognition
Multichannel environmental sound segmentation
Sound event localization and detection methods for environmental sound
Issues of related works
Proposed method
Feature extraction
Sound source localization and separation
Sound source separation and classification
Evaluation
Analysis of the overfitting to the relationship between the DOA and the class
Comparison between various model structures
Results and discussion
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call