Abstract

Sound classification is a broad area of research that has gained much attention in recent years. The sound classification systems based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have undergone significant enhancements in the recognition capability of models. However, their computational complexity and inadequate exploration of global dependencies for long sequences restrict improvements in their classification results. In this paper, we show that there are still opportunities to improve the performance of sound classification by substituting the recurrent architecture with the parallel processing structure in the feature extraction. In light of the small-scale and high-dimension sound datasets, we propose the use of the multihead attention and support vector machine (SVM) for sound taxonomy. The multihead attention is taken as the feature extractor to obtain salient features, and SVM is taken as the classifier to recognize all categories. Extensive experiments are conducted across three acoustically characterized public datasets, UrbanSound8K, GTZAN, and IEMOCAP, by using two commonly used audio spectrograms as inputs, respectively, and we fully evaluate the impact of parameters and feature types on classification accuracy. Our results suggest that the proposed model can reach comparable performance with existing methods and reveal its strong generalization ability of sound taxonomy.

Highlights

  • As an essential medium in human-machine interactions, automatic sound classification (ASC) is a wide area of study that has been investigated for years, mostly focusing on its subareas such as environment sound classification (ESC), music genre classification (MGC), and speech emotion recognition (SER)

  • Feature 1 made a larger contribution to the improvement of accuracy than Feature 2, no matter how many L and heads were set, which indicated that mel-spectrograms can enhance the predictive capability of the model effectively in ESC tasks

  • One reason is that Logistic Regression (LR) is a linear classification model, and K-Nearest Neighbor (KNN) is not able to estimate the prediction error statistically, which leads to great fluctuations in the results. is indicated that compared to environmental events and music genres, we need more kinds of features to make accurate judgments on the emotion. e high dimension of the data and the uneven distribution of data onto categories are not suitable for LR and KNN classifiers

Read more

Summary

Introduction

As an essential medium in human-machine interactions, automatic sound classification (ASC) is a wide area of study that has been investigated for years, mostly focusing on its subareas such as environment sound classification (ESC), music genre classification (MGC), and speech emotion recognition (SER). Deep learning approaches are drawing more attention in sound classification tasks. Due to its strong learning ability and excellent generalizability to extract taskspecific hierarchical feature representations from large quantities of training data, the deep neural network (DNN) has shown impressive performance in the research of automatic speech recognition and music information retrieval [1, 2]. CNN involves a group of interleaved convolutional layers and pooling layers followed by a certain number of fully connected layers. It shares weights by using locally connected filters, which makes inputs translation invariant, and these convolution filters have interpretable time and frequency significance for the audio spectrogram. Mao et al [5] utilize the CNN to learn affect-salient feature representations from local invariant features that are preprocessed by a sparse autoencoder for the speech emotion

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call