Multimodal medical image synthesis plays a crucial role in enhancing diagnostic accuracy and understanding disease progression, particularly in Alzheimers disease (AD). However, existing methods often focus on single-modality or single-time synthesis, overlooking the complexities of integrating multiple imaging modalities and longitudinal data. Furthermore, these models tend to ignore patient-specific factors like age, health conditions, and sex, limiting their practical applicability in clinical settings. To address these limitations, we propose CrossSim, a novel residual vision transformer-based framework for multimodal and longitudinal medical image synthesis. Our model integrates cross-attention-based feature fusion to handle personalized data such as age, health state, and sex. This allows for the generation of more clinically relevant synthetic images that better represent the complexities of real medical data. In contrast to existing methods, CrossSim excels in synthesizing images that accurately reflect changes over time and across modalities. We conduct extensive experiments on the ADNI dataset to evaluate the effectiveness of our approach. The results demonstrate significant improvements in key metrics such as PSNR, SSIM, and RMSE, confirming the superior performance of CrossSim in both qualitative and quantitative analyses. This study emphasizes the clinical significance of CrossSim, offering a valuable tool for enhancing diagnostic accuracy and advancing our understanding of Alzheimers disease progression.
Read full abstract