Abstract

Environmental Sound Classification (ESC) plays a vital role in machine auditory scene perception. Deep learning based ESC methods., such as the Dilated Convolutional Neural Network (D-CNN)., have achieved the state-of-art results on public datasets. However., the D-CNN ESC model size is often larger than 100MB and is only suitable for the systems with powerful GPUs., which prevents their applications in handheld devices. In this study., we take the D-CNN ESC framework and focus on reducing the model size while maintaining the ESC performance. As a result., a lightweight D-CNN (termed as LD-CNN) ESC system is developed. Our work lies on twofold. First., we propose into reduce the number of parameters in the convolution layers by factorizing a two-dimensional convolution filters $(L \times W)$ to two separable one-dimensional convolution filters ( $L \times 1$ and $1\times W$ ). Second., we propose to replace the first fully connection layer (FCL) by a Feature Sum layer (FSL) to further reduce the number of parameters. This is motivated by our finding that the features of the environmental sounds have weak absolute locality property and a global sum operation can be applied to compress the feature map. Experiments on three public datasets (ESC50., UrbanSound8K., and CICESE) show that the proposed system offers comparable classification performance but with a much smaller model size. For example., the model size of our proposed system is about 2.05MB., which is 50 times smaller than the original D-CNN model., but at a loss of only 1%-2 % classification accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call