Abstract

In this work, we have proposed STLEV (Siamese Neural Network with Triplet Loss and Long Short-Term Memory for Emotion recognition in Videos) a method for emotion recognition in videos. This method is beneficial for cognitive Human-Computer interaction. Emotion recognition from video is a challenging task as only few frames of a video will contain relevant information for recognizing emotions. Also identifying the apex frame is challenging. Emotion recognition from videos involves extracting the features from the frames of a video and classifying the sequence of features. For feature extraction we have used Siamese Neural Network (SNN) which is metric based meta-learning model. SNN is trained with triplet loss. For classification, Long Shot-Term Memory (LSTM) model is used. In this work we have used spatial and temporal aspects of video. Spatial features are extracted using SNN and temporal features using LSTM. Deep learning models need large amount of data for good accuracy. We have addressed this problem by leveraging few shot learning aspects of metric meta-learning so that fewer samples are needed for training the model used for feature extraction. Experiments done using BU-4DFE dataset achieved an accuracy of 87.5% which demonstrates the efficiency of the method STLEV.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call