Abstract

Acoustic scene classification has gained great interests in recent years due to its diverse applications. Various acoustic and visual features have been proposed and evaluated. However, few studies have investigated acoustic and visual feature aggregation for acoustic scene classification. In this paper, we investigated various feature sets based on the fusion of acoustic and visual features. Specifically, acoustic features are directly extracted from the waveform: spectral centroid, spectral entropy, spectral flux, spectral roll-off, short-time energy, zero-crossing rate, and Mel-frequency Cepstral coefficients. For visual features, we calculate local binary pattern, histogram of gradients, and moments based on the audio scene time-frequency representation. Then, three feature selection algorithms are applied to various feature sets to reduce feature dimensionality: correlation-based feature selection, principal component analysis, and ReliefF. Experimental results show that our proposed system was able to achieve an accuracy improvement of 15.43% compared to the baseline system with the development set. When all development sets are used for training, the performance based on the evaluation set provided by the TUT Acoustic scene 2016 challenge is 87.44%, which is the fourth best among all non-neural network systems.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call