The research of multimodal emotion recognition based on facial expressions, speech, and body gestures is crucial for oncoming intelligent human–computer interfaces. However, it is a very difficult task and has seldom been researched in this combination in the past years. Based on the GEMEP and Polish databases, this contribution focuses on trimodal emotion recognition from facial expressions, speech, and body gestures, including feature extraction, feature fusion, and multimodal classification of the three modalities. In particular, for feature fusion, two novel algorithms including supervised least squares multiset kernel canonical correlation analysis (SLSMKCCA) and sparse supervised least squares multiset kernel canonical correlation analysis (SSLSMKCCA) are presented, respectively, to carry out efficient facial expression, speech, and body gesture feature fusion. Different from the traditional multiset kernel canonical correlation analysis (MKCCA) algorithms, our SLSKMCCA algorithm is a supervised version and is based on the least squares form. The SSLSKMCCA algorithm is implemented by the combination of SLSMKCCA and a sparse item (L1 Norm). Moreover, two effective solving algorithms for SLSMKCCA and SSLSMKCCA are presented in addition, which use the alternated least squares and augmented Lagrangian multiplier methods, respectively. The extensive experimental results on the popular public GEMEP and Polish databases show that the recognition rate of multimodal emotion recognition is superior to bimodal and monomodal emotion recognition on average, and our presented SLSMKCCA and SSLSMKCCA fusion methods both obtain very high recognition rates, especially for the SSLSMKCCA fusion method.
Read full abstract