Abstract

One dimension convolutional neural networks (1D CNN) that directly take raw waveforms as input has less competition than 2D CNN recognizing environmental sound. In order to overcome its disadvantages, we propose a novel lightweight 1D CNN structure by employing attention mechanism, which has significant improvement in both accuracy and computational complexity. Concretely, (1) two attention modules are constructed along channel and time dimension separately, and combined to give an intermediate feature map, which focus on key frequency band and semantically related time frame information. (2) Without increasing training overhead, snapshot ensemble is employed to further improve performance. Results from two benchmarking datasets (UrbanSound8k, ESC-10) demonstrated that: by employing attention mechanism, our model outperforms all of the previously reported 1D CNN approaches in accuracy with less parameters. Meanwhile with improved performance gain, the proposed model is superior than most of the existing spectral-based 2D CNN approaches and competitive with SOTA performance, while with orders of magnitude parameters fewer. Overall, it indicates our model is compact and has good potential in practical resource-limited applications, such as sound recognition on embedded platform.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.