Ensemble of convolutional neural networks to improve animal audio classification

Loris Nanni,Sheryl Brahnam,Carlos N Silla,Rafael B Mangolin,Yandre M G Costa,Rafael L Aguiar

doi:10.1186/s13636-020-00175-3

Abstract

In this work, we present an ensemble for automated audio classification that fuses different types of features extracted from audio files. These features are evaluated, compared, and fused with the goal of producing better classification accuracy than other state-of-the-art approaches without ad hoc parameter optimization. We present an ensemble of classifiers that performs competitively on different types of animal audio datasets using the same set of classifiers and parameter settings. To produce this general-purpose ensemble, we ran a large number of experiments that fine-tuned pretrained convolutional neural networks (CNNs) for different audio classification tasks (bird, bat, and whale audio datasets). Six different CNNs were tested, compared, and combined. Moreover, a further CNN, trained from scratch, was tested and combined with the fine-tuned CNNs. To the best of our knowledge, this is the largest study on CNNs in animal audio classification. Our results show that several CNNs can be fine-tuned and fused for robust and generalizable audio classification. Finally, the ensemble of CNNs is combined with handcrafted texture descriptors obtained from spectrograms for further improvement of performance. The MATLAB code used in our experiments will be provided to other researchers for future comparisons at https://github.com/LorisNanni.

Highlights

Sound classification has been assessed as a pattern recognition task in different application domains for a long time
One pivotal milestone has been the popularization of graphics processing units (GPUs), devices that have made it much more feasible to train convolutional neural networks (CNNs), a powerful deep learning architecture developed by LeCun et al [26]
For several animal audio classification problems, we test the performance obtained by fine-tuning different pretrained CNNs (AlexNet, GoogleNet, Vgg-16, Vgg-19, ResNet, and Inception) on ImageNet, demonstrating that an ensemble of different fine-tuned CNNs maximizes the performance in our tested animal audio classification problems;

Summary

Introduction

Sound classification has been assessed as a pattern recognition task in different application domains for a long time. One pivotal milestone has been the popularization of graphics processing units (GPUs), devices that have made it much more feasible to train convolutional neural networks (CNNs), a powerful deep learning architecture developed by LeCun et al [26]. The feature extraction step, for instance, has evolved to such a point that many researchers view it as a form of feature engineering, the goal being to develop powerful feature vectors calculated to describe patterns in specific ways relevant to the task at hand. These engineered features are commonly described in the literature as handcrafted or handmade features. The main objective behind feature engineering is to create features that place patterns belonging to the same class close to each other in the

Objectives

Results

Conclusion