Abstract
People generally perceive other people’s emotions based on speech and facial expressions, so it can be helpful to use speech signals and facial images simultaneously. However, because the characteristics of speech and image data are different, combining the two inputs is still a challenging issue in the area of emotion-recognition research. In this paper, we propose a method to recognize emotions by synchronizing speech signals and image sequences. We design three deep networks. One of the networks is trained using image sequences, which focus on facial expression changes. Facial landmarks are also input to another network to reflect facial motion. The speech signals are first converted to acoustic features, which are used for the input of the other network, synchronizing the image sequence. These three networks are combined using a novel integration method to boost the performance of emotion recognition. A test comparing accuracy is conducted to verify the proposed method. The results demonstrated that the proposed method exhibits more accurate performance than previous studies.
Highlights
High-performance personal computers have been rapidly popularized with the technological development of information society
The speech signals are first converted to acoustic features, which are used for the input of the other network, synchronizing the image sequence
The need to determine whether a given speech signal and image sequence should be classified as the acting section or the silence section arises in many emotion-recognition systems
Summary
High-performance personal computers have been rapidly popularized with the technological development of information society. The convolutional neural network (CNN) is the most popular model among several deep-learning models It convolves input images through many filters and automatically produces a feature map. Various studies have combined facial features and the deep-learning-based model to boost the performance of facial expression recognition [24, 38, 46]. Recognize emotions from speech signals and image sequences are different, combining the two inputs is still a challenging issue in the area of emotion-recognition research. We propose a method to recognize emotions by synchronizing speech signals and image sequences. The speech signals are first converted to acoustic features, which are used for the input of the other network, synchronizing the image sequence.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have