Abstract
When automatic facial expression recognition is applied to video sequences of speaking subjects, the recognition accuracy has been noted to be lower than with video sequences of still subjects. This effect known as the speaking effect arises during spontaneous conversations, and along with the affective expressions the speech articulation process influences facial configurations. In this work we question whether, aside from facial features, other cues relating to the articulation process would increase emotion recognition accuracy when added in input to a deep neural network model. We develop two neural networks that classify facial expressions in speaking subjects from the RAVDESS dataset, a spatio-temporal CNN and a GRU cell RNN. They are first trained on facial features only, and afterwards both on facial features and articulation related cues extracted from a model trained for lip reading, while varying the number of consecutive frames provided in input as well. We show that using DNNs the addition of features related to articulation increases classification accuracy up to 12%, the increase being greater with more consecutive frames provided in input to the model.
Highlights
In what Feldman Barrett et al [1] name the “common view”, certain emotion categories are reliably signaled or revealed by specific configurations of facial-muscle movements
In order to investigate both these architectures, we develop two models, a convolutional neural network (CNN) and a gated recurrent unit (GRU) recurrent neural network (RNN), to classify the emotion expressed by a speaking subject in a video sequence
We explore the problem of automatic facial expression recognition in speaking subjects
Summary
In what Feldman Barrett et al [1] name the “common view”, certain emotion categories are reliably signaled or revealed by specific configurations of facial-muscle movements. Whatever the agreement with the pronouncements of this common view, it is surprising to notice how a straightforward fact has been hitherto overlooked to some extent: the facial apparatus, aside from being used in communicating affect, is much the same involved in articulation when subjects are speaking. This is allegedly occurring in spontaneous interactions and as a matter of fact, even a neutral expression of a speaking subject may be confused with emotional expressions [2]. While critiques of Ekman’s basic emotions theory are aplenty and alternative theories of affect do exist [1], it stands that most research in AFER, as well as datasets, stand directly or indirectly on the shoulders of this theory aiming at classifying these (or a similar set of) discrete categories of emotion
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.