Abstract

Convolutional neural networks with time-frequency feature representation for acoustic scene classification have been attracting increasing attentions. However, most of the existing methods are restricted to design a plain discriminative model with low-level feature representations and hard to leverage a priori knowledge of sound events in a particular scene. This provides some leeway to boost the performance. In this paper, Adaptive time-frequency Feature Resolution Network (AdaFRN) is proposed to embrace the progress in attention mechanism, feature fusion, and very deep architecture. Specifically, a multi-resolution attention distributor is applied to automatically select a proper feature resolution to capture remarkable events and a bi-direction feature pyramid network is utilized to transmit information across multiple scales. Extensive experiments demonstrate that the proposed model yields better results than state-of-the-art methods. With the normalized mixup strategy, our model outperforms the state-of-the-art method by 1.6%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call