Abstract

AI-based approaches, especially deep learning have made remarkable achievements in Speech Emotion Recognition (SER). Needless to say, Convolutional Neural Networks (CNNs) have been the backbone of many of these solutions. Although the use of CNNs have resulted in high performing models, building them needs domain knowledge and direct human intervention. The same issue arises while improving a model. To solve this problem, we use techniques that were firstly introduced in Neural Architecture Search (NAS) and use a genetic process to search for models with improved accuracy. More specifically, we insert blocks with dynamic structures in between the layers of an already existing model and then use genetic operations (i.e. selection, mutation, and crossover) to find the best performing structures. To validate our method, we use this algorithm to improve architectures by searching on the Berlin Database of Emotional Speech (EMODB). The experimental results show at least 1.7% performance improvement in terms of Accuracy on EMODB test set.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call