Assessing the condition of every schizophrenia patient correctly normally requires lengthy and frequent interviews with professionally trained doctors. To alleviate the time and manual burden on those mental health professionals, this paper proposes a multimodal assessment model that predicts the severity level of each symptom defined in Scale for the Assessment of Thought, Language, and Communication (TLC) and Positive and Negative Syndrome Scale (PANSS) based on the patient's linguistic, acoustic, and visual behavior. The proposed deep-learning model consists of a multimodal fusion framework and four unimodal transformer-based backbone networks. The second-stage pre-training is introduced to make each off-the-shelf pre-trained model learn the pattern of schizophrenia data more effectively. It learns to extract the desired features from the view of its modality. Next, the pre-trained parameters are frozen, and the light-weight trainable unimodal modules are inserted and fine-tuned to keep the number of parameters low while maintaining the superb performance simultaneously. Finally, the four adapted unimodal modules are fused into a final multimodal assessment model through the proposed multimodal fusion framework. For the purpose of validation, we train and evaluate the proposed model on schizophrenia patients recruited from National Taiwan University Hospital, whose performance achieves 0.534/0.685 in MAE/MSE, outperforming the related works in the literature. Through the experimental results and ablation studies, as well as the comparison with other related multimodal assessment works, our approach not only demonstrates the superiority of our performance but also the effectiveness of our approach to extract and integrate information from multiple modalities.
Read full abstract