Abstract

Most environmental sound recognition models use acoustic features like log-mel spectrogram (Logmel) or mel frequency cepstral coefficient (MFCC) as an input training network model in recent years, but the result of recognition is unsatisfactory. These acoustic features were originally designed for speech recognition and music recognition, which may not represent environmental sound comprehensively. In this paper, we designed a double-input convolutional neural network model, adopting Logmel features and raw waveforms as inputs and extracting the respective features separately using a convolutional neural network, after that performing feature combination and recognition. The network model performs feature extraction directly on raw waveforms and can extract some specific information that may not be contained in these acoustic features, which is complementary to Logmel features and improves the feature’s ability to represent environmental sound. The experiments over the GoogleAudioSet dataset showed that the proposed network model achieves a recognition result of 95.1%, which is better than the network model using a single feature and combining multiple acoustic features as input.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.