ABSTRACTDepression, a prevalent mental disorder in modern society, significantly impacts people's daily lives. Recently, there have been advancements in developing automated diagnosis models for detecting depression. However, data scarcity, primarily due to privacy concerns, has posed a challenge. Traditional speech features have limitations in representing knowledge for depression diagnosis, and the complexity of deep learning algorithms necessitates substantial data support. Furthermore, existing multimodal methods based on neural networks overlook the heterogeneity gap between different modalities, potentially resulting in redundant information. To address these issues, we propose a multimodal depression detection model based on the Enhanced Cross‐Attention (ECA) Mechanism. This model effectively explores text‐speech interactions while considering modality heterogeneity. Data scarcity has been mitigated by fine‐tuning pre‐trained models. Additionally, we design a modal fusion module based on ECA, which emphasizes similarity responses and updates the weight of each modal feature based on the similarity information between modal features. Furthermore, for speech feature extraction, we have reduced the computational complexity of the model by integrating a multi‐window self‐attention mechanism with the Fourier transform. The proposed model is evaluated on the public dataset, DAIC‐WOZ, achieving an accuracy of 80.0% and an average F1 value improvement of 4.3% compared with relevant methods.
Read full abstract