Abstract

People generally perceive other people’s emotions based on speech and facial expressions, so it can be helpful to use speech signals and facial images simultaneously. However, because the characteristics of speech and image data are different, combining the two inputs is still a challenging issue in the area of emotion-recognition research. In this paper, we propose a method to recognize emotions by synchronizing speech signals and image sequences. We design three deep networks. One of the networks is trained using image sequences, which focus on facial expression changes. Facial landmarks are also input to another network to reflect facial motion. The speech signals are first converted to acoustic features, which are used for the input of the other network, synchronizing the image sequence. These three networks are combined using a novel integration method to boost the performance of emotion recognition. A test comparing accuracy is conducted to verify the proposed method. The results demonstrated that the proposed method exhibits more accurate performance than previous studies.

Highlights

  • High-performance personal computers have been rapidly popularized with the technological development of information society

  • The speech signals are first converted to acoustic features, which are used for the input of the other network, synchronizing the image sequence

  • The need to determine whether a given speech signal and image sequence should be classified as the acting section or the silence section arises in many emotion-recognition systems

Read more

Summary

Introduction

High-performance personal computers have been rapidly popularized with the technological development of information society. The convolutional neural network (CNN) is the most popular model among several deep-learning models It convolves input images through many filters and automatically produces a feature map. Various studies have combined facial features and the deep-learning-based model to boost the performance of facial expression recognition [24, 38, 46]. Recognize emotions from speech signals and image sequences are different, combining the two inputs is still a challenging issue in the area of emotion-recognition research. We propose a method to recognize emotions by synchronizing speech signals and image sequences. The speech signals are first converted to acoustic features, which are used for the input of the other network, synchronizing the image sequence.

Facial emotion recognition
Audio emotion recognition
Multimodal emotion recognition
Preprocessing
Image-based model
Weighted joint fine-tuning
Baselines
Feature concatenation
Joint fine-tuning
Results
Integration Method
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call