Deep learning models show great potential in applications involving video-based affect recognition, including human-computer interaction, robotic interfaces, stress and depression assessment, and Alzheimer's disease detection. The low complex Multimodal Diverse Spatio-Temporal Network (MDSTN) has been analysed to effectively capture spatio-temporal information from audio-visual modalities for affect recognition using the Acted Facial Expressions in the Wild (AFEW) dataset. The scarcity of data is handled by data augmented parallel feature extraction for visual network. Visual features extracted by carefully reviewing and customizing Convolutional 3D architecture over different ranges are combined to train a neural network for classification. Multi-resolution Cochleagram (MRCG) features from speech, along with spectral and prosodic audio features, are processed by a supervised classifier. The late fusion technique is explored to integrate audio and video modalities, considering their processing over different temporal spans. The MDSTN approach significantly boosts the accuracy of basic emotion recognition to 71.54% on the AFEW dataset. It demonstrates exceptional proficiency in identifying emotions such as disgust and surprise, thus exceeding current benchmarks in real-world affect recognition.
Read full abstract