Abstract
In this letter, we propose a concise feature representation framework for acoustic scene classification by pruning embeddings obtained from SoundNet, a deep convolutional neural network. We demonstrate that the feature maps generated at various layers of SoundNet have redundancy. The proposed singular value decomposition based method reduces the redundancy while relying on the assumption that useful feature maps produced by different classes lie along different directions. This leads to ignoring the feature maps that produce similar embeddings for different classes. In the context of using an ensemble of classifiers on the various layers of SoundNet, pruning the redundant feature maps leads to reduction in dimensionality and computational complexity. Our experiments on acoustic scene classification demonstrate that ignoring 73% of feature maps reduces the performance by less than 1% with 12.67% reduction in computational complexity. In addition to this, we also show that the proposed pruning framework can be utilized to remove filters in the SoundNet network architecture, with 13x lesser model storage requirement. Also, the number of parameters reduce from 28 million to 2 million with marginal degradation in performance. This small model obtained after applying the proposed pruning procedure is evaluated on different acoustic scene classification datasets, and shows excellent generalization ability.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have