Bidfuse: Harnessing Bi-Directional Attention with Modality-Specific Encoders for Infrared-Visible Image Fusion
Infrared-visible image fusion aims to utilize the distinct advantages of each modality to provide a more comprehensive representation than either one could offer. Current state-of-the-art methods segregate encoded features into modality-specific features and modality-independent features. However, this segregation often fails to effectively isolate feature representation from different modalities, which causes possible information loss, rendering overly complex and ultimately unnecessary design. To tackle this issue, we propose BIDFuse, a novel two-stage bi-directional fusion network, designed to leverage the unique features from both modalities without explicit feature separation. We first use two encoders to specifically extract critical information from the two input images. Additionally, a cross-feeding mechanism is implemented to feed the features from one encoder directly into the input stream of the other decoder, enhancing the reconstruction process with information from both sources. Then, the bi-directional attention module is designed to fuse the features from both modalities and generate a fused image. Competitive experimental results demonstrate the effectiveness of our method for image fusion on the MSRS dataset and for low-light object detection on the M3FD dataset.
- Research Article
3
- 10.1117/1.jei.29.1.013014
- Feb 4, 2020
- Journal of Electronic Imaging
Image fusion obtains a desired image by integrating the useful information of multiple input images. Most traditional fusion strategy is usually guided by image local contrast or variance, which cannot well represent visual discernable features of source images. And the undesirable seam effects or artifacts produced due to the inconsistency between fusion weight map and image content may severely degrade the visual quality of the fused images. An efficient image fusion method with structural saliency measure and content adaptive consistency verification was proposed. The fusion is implemented under the nonsubsampled contourlet transform (NSCT)-based image fusion framework. The low-frequency NSCT decomposition coefficients are fused with the weight map constructed by considering both structural saliency and visual uniqueness features and refined by spatial consistency with guide filter. The high-frequency NSCT decomposition coefficients are fused with structural saliency. The performances of the proposed method have been verified on several pairs of multifocus images, infrared-visible images, and multimodal medical images. Experimental results clearly demonstrate the superiority of the proposed algorithm compared with several existing state-of-the-art algorithms in terms of both visual and quantitative comparison.
- Research Article
56
- 10.1016/j.optlaseng.2022.107078
- Apr 21, 2022
- Optics and Lasers in Engineering
Multimodal super-resolution reconstruction of infrared and visible images via deep learning
- Research Article
1
- 10.1504/ijspm.2019.10027884
- Jan 1, 2019
- International Journal of Simulation and Process Modelling
Following the development of smart city, informative images play a more and more important role in recognition, detection, and perception. As an efficient way, image fusion technique integrates information from multiple images. Multi-scale transform (MST) and sparse representation (SR) are widely used in infrared-visible image fusion. Traditional MST-based fusion methods are difficult to represent all features of source images. At the same time, traditional SR-based fusion methods do not consider morphological information of image features in dictionary learning processes. To overcome the defects of both MST- and SR-based fusion methods, this paper presents a infrared-visible image fusion framework by combining double-tree complex wavelet transform (DT-CWT) and SR. The source images are decomposed and clustered into high and low-pass bands by DT-CWT. The high-pass bands are fused by the sum modified-Laplacian (SML). The low-pass bands are fused by SR-based approach. The fused high- and low-pass bands are integrated and reconstructed by DT-CWT to form the final fused image. Comparing with five mainstream image fusion solutions, the proposed fusion framework can achieve state-of the-art performance in infrared-visible fusion images.
- Research Article
14
- 10.1109/tpami.2025.3591930
- Nov 1, 2025
- IEEE transactions on pattern analysis and machine intelligence
Image fusion aims to merge image pairs collected by different sensors over the same scene, preserving their distinct features. Recent works have often focused on designing various image fusion losses, developing different network architectures, and leveraging downstream tasks (e.g., object detection) for image fusion. However, a few studies have explored how language and semantic masks can serve as guidance to aid image fusion. In this paper, we investigate how the combination of language and masks can guide image fusion tasks, discarding the previously complex frameworks, which rely on downstream tasks, GAN-based cycle training, diffusion models, or deep image priors. Additionally, we exploit a recurrent neural network-like architecture to build a lightweight network that avoids the quadratic-cost of traditional attention mechanisms. To adapt the receptance weighted key value (RWKV) model to an image modality, we modify it into a bidirectional version using an efficient scanning strategy (ESS). To guide image fusion by language and mask features, we introduce a multi-modal fusion module (MFM) to facilitate information exchange. Comprehensive experiments show that the proposed framework achieved state-of-the-art results in various image fusion tasks (i.e., visible-infrared image fusion, multi-focus image fusion, multi-exposure image fusion, medical image fusion, hyperspectral and multispectral image fusion, and pansharpening).
- Research Article
3
- 10.1371/journal.pone.0277862
- Dec 15, 2022
- PLOS ONE
High-resolution magnetic resonance (MR) imaging has attracted much attention due to its contribution to clinical diagnoses and treatment. However, because of the interference of noise and the limitation of imaging equipment, it is expensive to generate a satisfactory image. Super-resolution (SR) is a technique that enhances an imaging system’s resolution, which is effective and cost-efficient for MR imaging. In recent years, deep learning-based SR methods have made remarkable progress on natural images but not on medical images. Most existing medical images SR algorithms focus on the spatial information of a single image but ignore the temporal correlation between medical images sequence. We proposed two novel architectures for single medical image and sequential medical images, respectively. The multi-scale back-projection network (MSBPN) is constructed of several different scale back-projection units which consist of iterative up- and down-sampling layers. The multi-scale machine extracts different scale spatial information and strengthens the information fusion for a single image. Based on MSBPN, we proposed an accurate and lightweight Multi-Scale Bidirectional Fusion Attention Network(MSBFAN) that combines temporal information iteratively. That supplementary temporal information is extracted from the adjacent image sequence of the target image. The MSBFAN can effectively learn both the spatio-temporal dependencies and the iterative refinement process with only a lightweight number of parameters. Experimental results demonstrate that our MSBPN and MSBFAN are outperforming current SR methods in terms of reconstruction accuracy and parameter quantity of the model.
- Research Article
1
- 10.1088/1402-4896/ad2328
- Feb 8, 2024
- Physica Scripta
In a tactical warfare setting, the efficacy of target detection becomes profoundly compromised due to prevalent environmental factors such as smoke, dust, and atmospheric interference. Such impediments invariably undermine the precision and reliability of identifying pivotal targets, thereby precipitating potentially dire ramifications. Remarkably, short-wave infrared technology has exhibited unparalleled proficiency in elucidating target attributes even amidst challenging conditions characterized by smoke, fog, or haze. Against this backdrop, the present study delineates a pioneering algorithmic framework that seamlessly amalgamates the imperatives of image registration and fusion. This is achieved through the deployment of an advanced dual-discriminator Generative Adversarial Network (GAN), specifically tailored for amalgamating short-wave infrared and visible light imagery within smoke-obscured contexts. Our methodology commences with the introduction of an augmented Speeded-Up Robust Features (SURF) algorithm, meticulously designed to rectify inherent misalignments within the input imagery. Subsequent enhancements encompass the refinement of the generator’s loss function and the integration of a multi-scale convolutional kernel, thereby facilitating the extraction and amalgamation of a more expansive array of salient features. This concerted effort culminates in the elevation of image fusion quality. To corroborate the efficacy and robustness of our proposed framework, rigorous validation procedures were conducted utilizing a meticulously curated dataset comprising short-wave infrared and visible light images. Empirical evaluations, encompassing both subjective and objective comparative analyses, unequivocally affirm the superior performance metrics of our fusion network. Specifically, our methodology surpasses alternative fusion techniques across multiple dimensions, including visual fidelity, perceptual quality, and structural congruence of synthesized images.
- Research Article
99
- 10.1109/access.2017.2685178
- Jan 1, 2017
- IEEE Access
Multi-scale-based image fusion is one of main fusion methods, in which multi-scale decomposition tool and feature extraction play very important roles. The quaternion wavelet transform (QWT) is one of the effective multi-scale decomposition tools. Therefore, this paper proposes a novel multimodal image fusion method using QWT and multiple features. First, we perform QWT on each source image to obtain low-frequency coefficients and high-frequency coefficients. Second, a weighted average fusion rule based on the phase and magnitude of low-frequency subband and spatial variance is proposed to fuse the low-frequency subbands. Next, a choose-max fusion rule based on the contrast and energy of coefficient is proposed to integrate the high-frequency subbands. Finally, the final fused image is constructed by inverse QWT. The proposed method is conducted on multi-focus images, medical images, infrared-visible images, and remote sensing images, respectively. Experimental results demonstrate the effectiveness of the proposed method.
- Research Article
8
- 10.1504/ijspm.2018.10012814
- Jan 1, 2018
- International Journal of Simulation and Process Modelling
Image fusion technology is widely used in different areas and can integrate complementary and relevant information of source images captured by multiple sensors into a unitary synthetic image. Image fusion technology as an efficient way to integrate information from multiple images plays a more and more important role in smart city. The quality of fused image affects the accuracy, efficiency, and robustness of the related applications. Existing sparse representation-based image fusion methods consist of overly complete and redundant dictionaries learning and sparse coding. However, overly complete and redundant dictionary does not consider the discriminative ability of dictionaries that may seriously affect the image fusion. A good dictionary is the key to a successful image fusion technique. To construct a discriminative dictionary, a novel framework that integrates an image-patches clustering and online dictionary learning methods is proposed for visible-infrared image fusion. The comparison experiments with existing solutions are used to validate and demonstrate the effectiveness of the proposed solution for image fusion.
- Research Article
107
- 10.1016/j.inffus.2024.102450
- May 3, 2024
- Information Fusion
Diff-IF: Multi-modality image fusion via diffusion model with fusion knowledge prior
- Conference Article
- 10.1117/12.2325391
- Oct 8, 2018
Fusion of thermal infrared and visual images is an important technique for real-time surveillance applications. Since image fusion is used in many real-time night vision applications such as target detection, recognition and tracking, it is important to understand the processing requirements and provide computationally efficient methods. In this paper, we present a real-time image fusion system designed for night vision supervisory and monitoring purposes. The system is equipped with two image sensors: TV and IR (thermal infrared). We implement a processing pipeline on the NVIDIA Tegra TX2. The TX2 platform is equipped with a many-core NVIDIA GPU and multi-core ARM CPU. Additionally, we present the system architecture as well as the design process of the efficient, real-time multi-spectral signal-processing algorithm. The algorithm is based on the second-generation wavelets also called lifting scheme. We show also a novel parallelization approach to perform the calculations in place, so no auxiliary memory is needed. This enables a fast parallel and pipelined processing flow. We achieve a considerable speedup compared to an optimized CPU implementation. The experimental results show that the fusion system can realize real-time image fusion processing for dual channels images at the rate of 30 frames per second for the Full-HD images.
- Research Article
46
- 10.1109/tmm.2020.2978640
- Mar 6, 2020
- IEEE Transactions on Multimedia
The visible and near-infrared images fusion aims at utilizing their spectrum characteristics to enhance visibility. However, the current visible and near-infrared fusion algorithms cannot well preserve spectrum characteristics, which results in color distortion and halo artifacts. Therefore, this paper proposes a new visible and near infrared images fusion algorithm by fully considering their different reflection and scattering characteristics. According to image degradation model, the reflection weight model and the transmission weight model are established, respectively. The reflection weight model is established by calculating the difference between the visible (red, green, and blue) spectra and the near-infrared spectrum, while maintaining the correlation of the visible spectra. The proposed reflection weight model can preserve the original reflection characteristic of objects in natural scenes. On the other hand, the transmission weight model is explicitly proposed by calculating the gradient ratio of the visible spectra to the near-infrared spectrum. The proposed transmission weight model intends to make full use of the strong transmission performance of the near-infrared spectrum, which can complement the details loss of the visible spectra caused by light scattering. Moreover, the fused image based on two models is further enhanced according to the reflection characteristics of near-infrared spectrum in case of the non-uniform illumination. The experimental results demonstrate that the proposed algorithm can not only well preserve spectrum characteristics, but also avoid color distortion while maintaining the naturalness, which outperforms the state-of-the-art.
- Research Article
25
- 10.1016/j.inffus.2024.102492
- May 24, 2024
- Information Fusion
RGBT tracking: A comprehensive review
- Research Article
89
- 10.1016/j.patcog.2010.08.006
- Aug 12, 2010
- Pattern Recognition
Particle swarm optimization based fusion of near infrared and visible images for improved face verification
- Conference Article
17
- 10.1109/icoase.2018.8548898
- Oct 1, 2018
Image enhancement using image fusion is very important for many applications. Image fusion is performed in spatial domain or in transform domain. Weighted averaging-based image fusion is achieved in spatial domain. The fused image using this method is a low-contrast and blurriness image. The quality of the fused image in transform domain is better than fused image in spatial domain. The standard Discrete Wavelet Transform (DWT) based fusion is performed in transform domain and it produces better result compared with the weighted averaging-based fusion. However, the fused image using DWT based fusion lacks spatial details. In this paper, a simple image fusion is proposed in order to enhance the quality of the fused image using weighted averaging or DWT based image fusion for visible-infrared images. In nature, infrared image is a blurred image, so the convolution process is performed between the input infrared image and the proposed sharpen filter. The output is an image with highlighting edges and fine details. Using weighted averaging or DWT based image fusion to fuse the visible image and the enhanced infrared image to obtain the final image. The experimental results showed that the proposed fusion method produces better results than the standard fusion methods.
- Research Article
- 10.3390/s25237118
- Nov 21, 2025
- Sensors (Basel, Switzerland)
As urbanization accelerates, façade defects in existing residential buildings have become increasingly prominent, posing serious threats to structural safety and residents' quality of life. In the high-density built environment of Shenzhen, traditional manual inspection methods exhibit low efficiency and high susceptibility to omission errors. This study proposes an integrated framework for façade defect detection that combines unmanned aerial vehicle (UAV)-based visible-light and thermal infrared imaging with deep learning algorithms and parametric three-dimensional (3D) visualization. Three representative residential communities constructed between 1988 and 2010 in Shenzhen were selected as case studies. The main findings are as follows: (1) the fusion of visible and thermal infrared images enables the synergistic identification of cracks and moisture intrusion defects; (2) shooting distance significantly affects mapping efficiency and accuracy-for low-rise buildings, 5-10 m close-range imaging ensures high mapping precision, whereas for high-rise structures, medium-range imaging at approximately 20-25 m achieves the optimal balance between detection efficiency, accuracy, and dual-defect recognition capability; (3) the developed Grasshopper-integrated mapping tool enables real-time 3D visualization and parametric analysis of defect information. The Knet-based model achieves an mIoU of 87.86% for crack detection and 79.05% for leakage detection. This UAV-based automated inspection framework is particularly suitable for densely populated urban districts and large-scale residential areas, providing an efficient technical solution for city-wide building safety management. This framework provides a solid foundation for the development of automated building maintenance systems and facilitates their integration into future smart city infrastructures.