Abstract
Research in sound classification and recognition is rapidly advancing in the field of pattern recognition. One important area in this field is environmental sound recognition, whether it concerns the identification of endangered species in different habitats or the type of interfering noise in urban environments. Since environmental audio datasets are often limited in size, a robust model able to perform well across different datasets is of strong research interest. In this paper, ensembles of classifiers are combined that exploit six data augmentation techniques and four signal representations for retraining five pre-trained convolutional neural networks (CNNs); these ensembles are tested on three freely available environmental audio benchmark datasets: (i) bird calls, (ii) cat sounds, and (iii) the Environmental Sound Classification (ESC-50) database for identifying sources of noise in environments. To the best of our knowledge, this is the most extensive study investigating ensembles of CNNs for audio classification. The best-performing ensembles are compared and shown to either outperform or perform comparatively to the best methods reported in the literature on these datasets, including on the challenging ESC-50 dataset. We obtained a 97% accuracy on the bird dataset, 90.51% on the cat dataset, and 88.65% on ESC-50 using different approaches. In addition, the same ensemble model trained on the three datasets managed to reach the same results on the bird and cat datasets while losing only 0.1% on ESC-50. Thus, we have managed to create an off-the-shelf ensemble that can be trained on different datasets and reach performances competitive with the state of the art.
Highlights
Sound classification and recognition have long been included in the field of pattern recognition
That visual representations of audio, such as spectrograms [6] and Mel-frequency Cepstral Coefficients spectrograms (Mel) [7], contain valuable information, powerful texture extraction techniques like local binary patterns (LBP) [8] and its many variants [9] began to be explored for audio classification [2,10]
ESC-50 [4]: an environmental sound classification dataset with 2000 samples evenly divided into 50 classes and five folds; each fold contains eight samples
Summary
Sound classification and recognition have long been included in the field of pattern recognition. Following the three classical pattern recognition steps of (i) preprocessing, (ii) feature/descriptor extraction, and (iii) classification, most early work in sound classification began by extracting features from audio recordings such as the Statistical Spectrum Descriptor or Rhythm Histogram [5]. Once it was recognized, that visual representations of audio, such as spectrograms [6] and Mel-frequency Cepstral Coefficients spectrograms (Mel) [7], contain valuable information, powerful texture extraction techniques like local binary patterns (LBP) [8] and its many variants [9] began to be explored for audio classification [2,10]. The authors combined CNN with visual features; the fusion of CNNs with traditional techniques was shown to outperform both the stand-alone conventional and single deep learning approaches
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have