Recently, multimodal data analysis in medical domain has started receiving a great attention. Researchers from both computer science, and medicine are trying to develop models to handle multimodal medical data. However, most of the published work have targeted the homogeneous multimodal data. The collection and preparation of heterogeneous multimodal data is a complex and time-consuming task. Further, development of models to handle such heterogeneous multimodal data is another challenge. This study presents a cross modal transformer-based fusion approach for multimodal clinical data analysis using medical images and clinical data. The proposed approach leverages the image embedding layer to convert image into visual tokens, and another clinical embedding layer to convert clinical data into text tokens. Further, a cross-modal transformer module is employed to learn a holistic representation of imaging and clinical modalities. The proposed approach was tested for a multi-modal lung disease tuberculosis data set. Further, the results are compared with recent approaches proposed in the field of multimodal medical data analysis. The comparison shows that the proposed approach outperformed the other approaches considered in the study. Another advantage of this approach is that it is faster to analyze heterogeneous multimodal medical data in comparison to existing methods used in the study, which is very important if we do not have powerful machines for computation.
Read full abstract