Multimodal neuroimaging is an emerging field that leverages multiple sources of information to diagnose specific brain disorders, especially when deep learning-based AI algorithms are applied. The successful combination of different brain imaging modalities using deep learning remains a challenging yet crucial research topic. The integration of structural and functional modalities is particularly important for the diagnosis of various brain disorders, where structural information plays a crucial role in diseases such as Alzheimer's, while functional imaging is more critical for disorders such as schizophrenia. However, the combination of functional and structural imaging modalities can provide a more comprehensive diagnosis. In this work, we present MultiViT, a novel diagnostic deep learning model that utilizes vision transformers and cross-attention mechanisms to effectively fuse information from 3D gray matter maps derived from structural MRI with functional network connectivity matrices obtained from functional MRI using the ICA algorithm. MultiViT achieves an AUC of 0.833, outperforming both our unimodal and multimodal baselines, enabling more accurate classification and diagnosis of schizophrenia. In addition, using vision transformer's unique attentional maps in combination with cross-attentional mechanisms and brain function information, we identify critical brain regions in 3D gray matter space associated with the characteristics of schizophrenia. Our research not only significantly improves the accuracy of AI-based automated imaging diagnostics for schizophrenia, but also pioneers a rational and advanced data fusion approach by replacing complex, high-dimensional fMRI information with functional network connectivity, integrating it with representative structural data from 3D gray matter images, and further providing interpretative biomarker localization in a 3D structural space.
Read full abstract