Abstract

The peak frame selection with corresponding voice segment identification is a challenging problem in the audio-video human emotion recognition. The peak frame is a most relevant descriptor of facial expression that can be inferred from varied emotional states. In this paper, an improved Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) is proposed to select the key frame based on facial action units co-occurrence behavior in the visual sequences. The proposed method utilizes the experts judgments while identifying the peak frame in video modality. It locates the peak voiced segment in audio modality using synchronous and asynchronous temporal relationship with selected peak visual frame. The facial action unit features of peak frame are fused with nine statistical characteristics of spectral features of the voiced segment. The weighted product rule-based decision level fusion is performed to combine the posterior probabilities of two independent (i.e., audio, and video) support vector machines based classification models. The performance of the proposed peak frame and voiced segment selection method is evaluated and compared with the existing Maximum-Dissimilarity (MAX-DIST), Dendrogram -Clustering (DEND-CLUSTER), and Emotion Intensity (EIFS) based peak frame selection methods on two challenging emotion datasets in two different languages namely eNTERFACE’05 in English and BAUM-1a in Turkish. The results show that the system with the proposed method has performed better than the existing techniques, and it achieved 88.03%, and 84.61% emotion recognition accuracies on the eNTERFACE’05 and BAUM-1a datasets respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call