In recent years, medical diagnosis and object detection have been significantly enhanced by the integration of multi-modal image fusion techniques. This study proposes an Adaptive Transformer-Based Multi-Modal Image Fusion (AT-MMIF) framework designed for real-time medical diagnosis and object detection. The framework employs a Transformer architecture to capture both global and local feature correlations across multiple imaging modalities, including MRI, CT, PET, and X-ray, for more accurate diagnostic results and faster object detection in medical imagery. The fusion process incorporates spatial and frequency-domain information to improve the clarity and detail of the output images, enhancing diagnostic accuracy. The adaptive attention mechanism within the Transformer dynamically adjusts to the relevant features of different image types, optimizing fusion in real time. This leads to an improved sensitivity (98.5%) and specificity (96.7%) in medical diagnosis. Additionally, the model significantly reduces false positives and negatives, with an F1 score of 97.2% in object detection tasks. The AT-MMIF framework is further optimized for real-time processing with an average inference time of 120 ms per image and a model size reduction of 35% compared to existing multi-modal fusion models. By leveraging the strengths of Transformer architectures and adaptive learning, the proposed framework offers a highly efficient and scalable solution for real-time medical diagnosis and object detection in various clinical settings, including radiology, oncology, and pathology.