Abstract
Deep learning models show great potential in applications involving video-based affect recognition, including human-computer interaction, robotic interfaces, stress and depression assessment, and Alzheimer's disease detection. The low complex Multimodal Diverse Spatio-Temporal Network (MDSTN) has been analysed to effectively capture spatio-temporal information from audio-visual modalities for affect recognition using the Acted Facial Expressions in the Wild (AFEW) dataset. The scarcity of data is handled by data augmented parallel feature extraction for visual network. Visual features extracted by carefully reviewing and customizing Convolutional 3D architecture over different ranges are combined to train a neural network for classification. Multi-resolution Cochleagram (MRCG) features from speech, along with spectral and prosodic audio features, are processed by a supervised classifier. The late fusion technique is explored to integrate audio and video modalities, considering their processing over different temporal spans. The MDSTN approach significantly boosts the accuracy of basic emotion recognition to 71.54% on the AFEW dataset. It demonstrates exceptional proficiency in identifying emotions such as disgust and surprise, thus exceeding current benchmarks in real-world affect recognition.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.