An Enhanced CNN-2D for Audio-Visual Emotion Recognition (AVER) Using ADAM Optimizer

D.N.V.S.L.S Indira, Et Al

doi:10.17762/turcomat.v12i5.2030

Abstract

The importance of integrating visual components into the speech recognition process for improving robustness has been identified by recent developments in audio visual emotion recognition (AVER). Visual characteristics have a strong potential to boost the accuracy of current techniques for speech recognition and have become increasingly important when modelling speech recognizers. CNN is very good to work with images. An audio file can be converted into image file like a spectrogram with good frequency to extract hidden knowledge. This paper provides a method for emotional expression recognition using Spectrograms and CNN-2D. Spectrograms formed from the signals of speech it’s a CNN-2D input. The proposed model, which consists of three layers of CNN and those are convolution layers, pooling layers and fully connected layers extract discriminatory characteristics from the representations of spectrograms and for the seven feelings, performance estimates. This article compares the output with the existing SER using audio files and CNN. The accuracy is improved by 6.5% when CNN-2D is used.

Highlights

Our eyes are the most appropriate place to look
When we implemented CNN-2D we converted the whole dataset into image files from wav files
It is a part in Audio-Visual Emotion Recognition

Summary

Introduction

If eyes were hypothetically wiser and quicker than ears, wouldn't it be more useful for our eyes to send sound signals for processing?[1][2]. A day’s Artificial Intelligence and Neural Networks[3][8] can be implemented using different software like Python, Jupitor, Anakonda etc. These softwares are well suited for Image Analytics in a smooth manner. When we implemented CNN-2D we converted the whole dataset into image files from wav files. It is a part in Audio-Visual Emotion Recognition

Methods

Results

Conclusion