Abstract
Emotion expressions sometimes are mixed with the utterance expression in spontaneous face-to-face communication, which makes difficulties for emotion recognition. This article introduces the methods of reducing the utterance influences in visual parameters for the audio-visual-based emotion recognition. The audio and visual channels are first combined under a Multistream Hidden Markov Model (MHMM). Then, the utterance reduction is finished by finding the residual between the real visual parameters and the outputs of the utterance related visual parameters. This article introduces the Fused Hidden Markov Model Inversion method which is trained in the neutral expressed audio-visual corpus to solve the problem. To reduce the computing complexity the inversion model is further simplified to a Gaussian Mixture Model (GMM) mapping. Compared with traditional bimodal emotion recognition methods (e.g., SVM, CART, Boosting), the utterance reduction method can give better results of emotion recognition. The experiments also show the effectiveness of our emotion recognition system when it was used in a live environment.
Highlights
The last two decades have seen significant effort devoted to developing methods for automatic human emotion recognition (e.g., [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]), which is an attractive research issue due to its great potential in human-computer interactions (HCIs), virtual reality, etc
Most of existing work still combines the audio-visual parameters for emotion recognition with model of feature-level fusion or decision-level fusion (e.g., [19,20,45,46,47]), some of them just focus on getting Action Units (AUs) from facial expression rather than emotion recognition (e.g., [36,37,38,39,41,42,43,44,48,49])
(b) We propose the utterance-independent method to enhance the visual expression parameters for emotion recognition in spontaneous communication by the hybrid of the Multi-stream Hidden Markov Model (MHMM) and the fused Hidden Markov Model (HMM) inversion
Summary
The last two decades have seen significant effort devoted to developing methods for automatic human emotion recognition (e.g., [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]), which is an attractive research issue due to its great potential in human-computer interactions (HCIs), virtual reality, etc. (b) We propose the utterance-independent method to enhance the visual expression parameters for emotion recognition in spontaneous communication by the hybrid of the MHMM and the fused HMM inversion. We extend this work by introducing a Baum-Welch HMM inversion method for multi-stream HMMs. As shown by Choi and Hwang [54], Xie and Liu [56], and Moon and Hwang [57], the optimal visual counterpart Ov can be formulated as the optimization of the following object function, L(Ov) = logP(Oa, Ov|lav), given an audio input, where Oa is the audio features, and lav is the parameters of the fused HMM model. The GMM conversion will reduce the computing complexities compared with inversion method, it weakens the time-series analysis in the audio-visual processing by replacing HMM states with real audio observations. A where is the covariance matrix in audio vector space, k pk(Oa) is the probability that the given audio observation belongs to the mixture component (Figure 3)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.