Abstract

Automatic emotion recognition is a challenging task since emotion is communicated through different modalities. Deep Convolution Neural Networks (DCNN) and transfer learning have shown success in automatic emotion recognition using different modalities. However significant improvement in accuracy is still required for practical applications. Existing methods are still not effective in modelling the temporal relationships within emotional expressions or in identifying the salient features from different modes and fusing them to improve accuracies. In this paper, we present an automatic emotion recognition system using audio and visual modalities. VGG19 models are used to capture frame level facial features followed by a Long Short Term Memory (LSTM) to capture their temporal distribution at a segment level. A separate VGG19 model captures auditory features from Mel Frequency Cepstral Coefficients (MFCC). The extracted auditory and visual features are fused together and a Deep Neural Network (DNN) with attention is used in classification using majority voting. Voice Activity Detection (VAD) on the audio stream improves performance by reducing the outliers in learning. The system is evaluated using Leave One Subject Out (LOSO) and K-fold cross-validation and our system outperforms state of the art methods on two challenging databases.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.