End-to-end multimodal clinical depression recognition using deep neural networks: A comparative analysis

Muhammad Muzammel,Hanan Salam,Alice Othmani

doi:10.1016/j.cmpb.2021.106433

Abstract

Background and Objective: Major Depressive Disorder is a highly prevalent and disabling mental health condition. Numerous studies explored multimodal fusion systems combining visual, audio, and textual features via deep learning architectures for clinical depression recognition. Yet, no comparative analysis for multimodal depression analysis has been proposed in the literature. Methods: In this paper, an up-to-date literature overview of multimodal depression recognition is presented and an extensive comparative analysis of different deep learning architectures for depression recognition is performed. First, audio features based Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) are studied. Then, early-level and model-level fusion of deep audio features with visual and textual features through LSTM and CNN architectures are investigated. Results: The performance of the proposed architectures using an hold-out strategy on the DAIC-WOZ dataset (80% training, 10% validation, 10% test split) for binary and severity levels of depression recognition is tested. Using this strategy, a set of experiments have been performed and they have demonstrated: (1) LSTM-based audio features perform slightly better than CNN ones with an accuracy of 66.25% versus 65.60% for binary depression classes. (2) the model level fusion of deep audio and visual features using LSTM network performed the best with an accuracy of 77.16%, a precision of 53% for the depressed class, and a precision of 83% for the non-depressed class. The given network obtained a normalized Root Mean Square Error (RMSE) of 0.15 for depression severity level prediction. Using a Leave-One-Subject-Out strategy, this network achieved an accuracy of 95.38% for binary depression detection, and a normalized RMSE of 0.1476 for depression severity level prediction. Our best-performing architecture outperforms all state-of-the-art approaches on DAIC-WOZ dataset. Conclusions: The obtained results show that the proposed LSTM-based surpass the proposed CNN-based architectures allowing to learn temporal dynamics representations of multimodal features. Furthermore, model-level fusion of audio and visual features using an LSTM network leads to the best performance. Our best-performing architecture successfully detects depression using a speech segment of less than 8 seconds, and an average prediction computation time of less than 6ms; making it suitable for real-world clinical applications.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computer Methods and Programs in Biomedicine	Publication Date: Sep 28, 2021
Citations: 40	License type: publisher-specific-oa

R Discovery Prime

R Discovery Prime

End-to-end multimodal clinical depression recognition using deep neural networks: A comparative analysis

Abstract

Talk to us

Similar Papers

More From: Computer Methods and Programs in Biomedicine

Lead the way for us

Similar Papers

Analysis of correlation between audio and visual speech features for clean audio feature prediction in noise
Ibrahim Almajai ... Jonathan Darch
-
Ibrahim Almajai, et. al.Ibrahim Almajai ... Jonathan Darch
17 Sep 2006
17 Sep 2006

Talking heads synthesis from audio with deep neural networks
Taiki Shimba ... Hirotake Yamazoe
-
Taiki Shimba, et. al.Taiki Shimba ... Hirotake Yamazoe
01 Dec 2015
01 Dec 2015

Image Privacy Prediction Using Deep Neural Networks
Ashwini Tonge ... Cornelia Caragea
ACM Transactions on the Web | VOL. 14
Ashwini Tonge, et. al.Ashwini Tonge ... Cornelia Caragea
09 Apr 2020
ACM Transactions on the Web | VOL. 14

Human emotion recognition from videos using spatio-temporal and audio features
Munaf Rashid ... S A R Abu-Bakar
The Visual Computer | VOL. 29
Munaf Rashid, et. al.Munaf Rashid ... S A R Abu-Bakar
07 Dec 2012
The Visual Computer | VOL. 29

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

End-to-end multimodal clinical depression recognition using deep neural networks: A comparative analysis

Abstract

Talk to us

Similar Papers

More From: Computer Methods and Programs in Biomedicine