Abstract

Due to the significance of human behavioral intelligence in computing devices, this work focused on the facial expressions and speech of humans for their emotion recognition in multimodal (audio-video) signals. The audio-video signals consist of frames to represent the temporal activities of facial expressions and speech of humans. It become challenging to determine the efficient method to construct a spatial and temporal feature vector from the frame-wise spatial feature descriptor to describe the facial expressions and speech temporal information in audio-video signals. In this paper, an efficient temporal feature aggregation method is presented for human emotion recognition in audio-video signals. The Local Binary Pattern (LBP) feature of facial expressions and Mel Frequency Cepstral Coefficients (MFCCs) and its $\Delta+\Delta\Delta$ of speech are computed from each frame. The experiment analysis is performed to decide the efficient method for temporal feature aggregation, i.e., sum normalization or statistical functions, to construct a spatial and temporal feature vector. The multiclass Support Vector Machine (SVM) classification model is trained and tested to evaluate the performance of temporal feature aggregation method with LBP features and MFCCs and its $\Delta+\Delta\Delta$ features. The Bayesian optimization (BO) method determines the optimal hyper-parameters of the multiclass SVM classifier for emotion detection. The experiment analysis of proposed work is performed on publicly accessible and challenging Crowd-sourced Emotional Multimodal Actors-Dataset (CREMA-D) and compared with existing work.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call