Dual Attention Enhanced Transformer for Image Defocus Deblurring
Image Defocus Deblurring remains a challenging problem due to the uncertainty of the blurred region and the varying depth of field. Although the convolutional neural network (CNN) has achieved promising results on this task, its limited receptive field and static weights hinder the restoration performance. In contrast, Transformer models are able to mitigate the weaknesses of CNN. However, recent Transformer-based models that deal with Image Defocus Deblurring only utilize self-attention from either spatial or channel dimension, which neglects cross-dimensional information essential for restoration. In this paper, we propose a novel Transformer model, Dual Attention Enhanced Transformer (DAEformer), for Image Defocus Deblurring. DAEformer combines self-attention from both spatial and channel dimensions, meanwhile applying auxiliary enhanced attention modules. We present Spatial Attention Enhanced Block (SAEB) and Channel Attention Enhanced Block (CAEB), which not only fuse the spatial and channel information within blocks but also enhance details. Furthermore, we design a progressive hierarchical architecture that applies SAEB/CAEB at different levels to model distinct information and facilitate fusion across blocks. Experimental results demonstrate that DAEformer can achieve state-of-the-art results on the dual-pixel dataset.
- Research Article
- 10.3390/app15063173
- Mar 14, 2025
- Applied Sciences
Defocus deblurring is a challenging task in the fields of computer vision and image processing. The irregularity of defocus blur kernels, coupled with the limitations of computational resources, poses significant difficulties for defocused image restoration. Additionally, the varying degrees of blur across different regions of the image impose higher demands on feature capture. Insufficient fine-grained feature extraction can result in artifacts and the loss of details, while inadequate coarse-grained feature extraction can cause image distortion and unnatural transitions. To address these challenges, we propose a defocus image deblurring method based on a hybrid CNN–Mamba architecture. This approach employs a data-driven, network-based self-learning strategy for blur processing, eliminating the need for traditional blur kernel estimation. Furthermore, by designing parallel feature extraction modules, the method leverages the local feature extraction capabilities of CNNs to capture image details, effectively restoring texture and edge information. The Mamba module models long-range dependencies, ensuring global consistency in the image. On the real defocus blur dual-pixel image dataset DPDD, the proposed CMDDNet achieves a PSNR of 28.37 in the Indoor dataset, surpassing Uformer-B (28.23) while significantly reducing the parameter count to only 9.74 M, which is 80.9% less than Uformer-B (50.88 M). Although the PSNR on the Outdoor and Combined datasets is slightly lower, CMDDNet maintains competitive performance with a significantly reduced model size, demonstrating its efficiency and effectiveness in defocus deblurring. These results indicate that CMDDNet offers a promising trade-off between performance and computational efficiency, making it well suited for lightweight applications.
- Research Article
- 10.1038/s41598-025-07326-6
- Jul 2, 2025
- Scientific Reports
Defocus blur commonly arises from the cameras’ depth-of-field limitations. While the deep learning method shows promise for image restoration problems, defocus deblurring requires accurate training data comprising pairs of all-in-focus and defocus images, which can be difficult to collect in real-world scenarios. To address this problem, we propose a high-resolution iterative deblurring method for real scenes driven by a score-based diffusion model. The method trains a score network by learning the score function of focused images at different noise levels and reconstructs high-quality images through reverse-time stochastic differential equation (SDE). A prediction-correction (PC) framework corrects discretization errors in the reverse-time SDE to enhance the robustness of images during reconstruction. The iterative nature of diffusion models enables a gradual improvement in image quality by progressively enhancing details and refining marginal distribution with each iteration. This process allows the distribution of generated images to increasingly approximate that of sharply focused images. Unlike mainstream end-to-end approaches, this method does not require paired all-in-focus and defocus images to train the model. The real-world datasets, such as self-captured datasets, were used for model training. Additional testing was conducted on the RealBlur and DED datasets to evaluate the efficacy of the proposed method. Compared to DnCNN, FFDNet and CycleGAN, superior performance was achieved by the proposed method on real-world datasets, including self-captured scenarios, with experimental results showing improvements of approximately 13.4% in PSNR and 34.7% in SSIM. These results indicate that significant enhancement in the clarity of defocus images can be attained, effectively enabling high-resolution iterative defocus deblurring in real-world scenarios through the diffusion model.
- Research Article
5
- 10.1609/aaai.v37i3.25446
- Jun 26, 2023
- Proceedings of the AAAI Conference on Artificial Intelligence
Recent research showed that the dual-pixel sensor has made great progress in defocus map estimation and image defocus deblurring. However, extracting real-time dual-pixel views is troublesome and complex in algorithm deployment. Moreover, the deblurred image generated by the defocus deblurring network lacks high-frequency details, which is unsatisfactory in human perception. To overcome this issue, we propose a novel defocus deblurring method that uses the guidance of the defocus map to implement image deblurring. The proposed method consists of a learnable blur kernel to estimate the defocus map, which is an unsupervised method, and a single-image defocus deblurring generative adversarial network (DefocusGAN) for the first time. The proposed network can learn the deblurring of different regions and recover realistic details. We propose a defocus adversarial loss to guide this training process. Competitive experimental results confirm that with a learnable blur kernel, the generated defocus map can achieve results comparable to supervised methods. In the single-image defocus deblurring task, the proposed method achieves state-of-the-art results, especially significant improvements in perceptual quality, where PSNR reaches 25.56 dB and LPIPS reaches 0.111.
- Conference Article
82
- 10.1109/cvpr52688.2022.01582
- Jun 1, 2022
Defocus deblurring is a challenging task due to the spatially varying nature of defocus blur. While deep learning approach shows great promise in solving image restoration problems, defocus deblurring demands accurate training data that consists of all-in-focus and defocus image pairs, which is difficult to collect. Naive two-shot capturing cannot achieve pixel-wise correspondence between the defocused and all-in-focus image pairs. Synthetic aperture of light fields is suggested to be a more reliable way to generate accurate image pairs. However, the defocus blur generated from light field data is different from that of the images captured with a traditional digital camera. In this paper, we propose a novel deep defocus deblurring network that leverages the strength and overcomes the shortcoming of light fields. We first train the network on a light field-generated dataset for its highly accurate image correspondence. Then, we fine-tune the network using feature loss on another dataset collected by the two-shot method to alleviate the differences between the defocus blur exists in the two domains. This strategy is proved to be highly effective and able to achieve the state-of-the-art performance both quantitatively and qualitatively on multiple test sets. Extensive ablation studies have been conducted to analyze the effect of each network module to the final performance.
- Conference Article
- 10.1109/cisp-bmei.2016.7852704
- Oct 1, 2016
In order to reconstruct the all-in-focus image from a conventional camera, a spatially-varying defocus deblurring approach based on blur map and TV/L2 regularization was proposed. Firstly, lenticular defocus model was studied and analyzed, and the principle, characteristic and applicability of disk defocus model and Gaussian defocus model were generalized; Secondly, we modified the local contrast prior using edge properties, and combined it with the gradient of blurry edge to gain blur map, in addition, in order to decrease the noise of blur map and ambiguous edges, a guided filtering approach was also adopted to gain a better blur map; Finally, TV/L2 regularization method solved by an augmented Lagrangian method was employed to deblur the defocus image. The all-in-focus image is obtained using scale selection and image reconstruction. The experimental results based on both synthesized images and real images showed that the proposed approach can gain excellent all-in-focus images and the performance outperforms the state-of-the-art space-invariant methods and spatially-varying methods for defocus deblurring, and the all-in-focus images show a better visual effect.
- Research Article
2
- 10.1609/aaai.v39i7.32819
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
We propose Intra and Inter Parser-Prompted Transformers (PPTformer) that explore useful features from visual foundation models for image restoration. Specifically, PPTformer contains two parts: an Image Restoration Network (IRNet) for restoring images from degraded observations and a Parser-Prompted Feature Generation Network (PPFGNet) for providing IRNet with reliable parser information to boost restoration. To enhance the integration of the parser within IRNet, we propose Intra Parser-Prompted Attention (IntraPPA) and Inter Parser-Prompted Attention (InterPPA) to implicitly and explicitly learn useful parser features to facilitate restoration. The IntraPPA re-considers cross attention between parser and restoration features, enabling implicit perception of the parser from a long-range and intra-layer perspective. Conversely, the InterPPA initially fuses restoration features with those of the parser, followed by formulating these fused features within an attention mechanism to explicitly perceive parser information. Further, we propose a parser-prompted feed-forward network to guide restoration within pixel-wise gating modulation. Experimental results show that PPTformer achieves state-of-the-art performance on image deraining, defocus deblurring, desnowing, and low-light enhancement.
- Conference Article
1905
- 10.1109/cvpr52688.2022.01716
- Jun 1, 2022
In this paper, we present Uformer, an effective and efficient Transformer-based architecture for image restoration, in which we build a hierarchical encoder-decoder network using the Transformer block. In Uformer, there are two core designs. First, we introduce a novel locally-enhanced window (LeWin) Transformer block, which performs non-overlapping window-based self-attention instead of global self-attention. It significantly reduces the computational complexity on high resolution feature map while capturing local context. Second, we propose a learnable multi-scale restoration modulator in the form of a multi-scale spatial bias to adjust features in multiple layers of the Uformer decoder. Our modulator demonstrates superior capability for restoring details for various image restoration tasks while introducing marginal extra parameters and computational cost. Powered by these two designs, Uformer enjoys a high capability for capturing both local and global dependencies for image restoration. To evaluate our approach, extensive experiments are conducted on several image restoration tasks, including image denoising, motion deblurring, defocus deblurring and deraining. Without bells and whistles, our Uformer achieves superior or comparable performance compared with the state-of-the-art algorithms. The code and models are available at https://github.com/ZhendongWang6/Uformer.
- Research Article
12
- 10.1609/aaai.v37i2.25235
- Jun 26, 2023
- Proceedings of the AAAI Conference on Artificial Intelligence
By adopting popular pixel-wise loss, existing methods for defocus deblurring heavily rely on well aligned training image pairs. Although training pairs of ground-truth and blurry images are carefully collected, e.g., DPDD dataset, misalignment is inevitable between training pairs, making existing methods possibly suffer from deformation artifacts. In this paper, we propose a joint deblurring and reblurring learning (JDRL) framework for single image defocus deblurring with misaligned training pairs. Generally, JDRL consists of a deblurring module and a spatially invariant reblurring module, by which deblurred result can be adaptively supervised by ground-truth image to recover sharp textures while maintaining spatial consistency with the blurry image. First, in the deblurring module, a bi-directional optical flow-based deformation is introduced to tolerate spatial misalignment between deblurred and ground-truth images. Second, in the reblurring module, deblurred result is reblurred to be spatially aligned with blurry image, by predicting a set of isotropic blur kernels and weighting maps. Moreover, we establish a new single image defocus deblurring (SDD) dataset, further validating our JDRL and also benefiting future research. Our JDRL can be applied to boost defocus deblurring networks in terms of both quantitative metrics and visual quality on DPDD, RealDOF and our SDD datasets.
- Book Chapter
- 10.3233/faia241382
- Dec 13, 2024
- Frontiers in artificial intelligence and applications
The goal of single-image defocus deblurring is to reconstruct a clear image from a defocused one. Although existing methods perform well in common blurry scenes, they still face the challenge of feature extraction when meeting severely defocused areas. Therefore, the innovative progressive multi-scale feature extraction module (PMFEM) and the feature attention module (FAM) are proposed. The PMFEM gradually extracts and fuses multi-scale features through convolution kernels of different sizes. Three parallel paths process the input respectively, simplifying and optimizing the feature extraction process layer by layer to achieve progressive processing. The FAM dynamically enhances or suppresses features of specific scales by fusing multi-path feature information, thereby improving the model’s feature expression ability. Experimental results show that our method achieves state-of-the-art performance on the Dual-Pixel Defocus Deblurring (DPDD) and Regression Tree Fields (RTF) datasets. Our code can be obtained at https://github.com/HMin-Z/PFANet.
- Research Article
5
- 10.1007/s40747-025-01789-w
- Feb 17, 2025
- Complex & Intelligent Systems
Single Image Defocus Deblurring (SIDD) remains challenging due to spatially varying blur kernels, particularly in processing high-resolution images where traditional methods often struggle with artifact generation, detail preservation, and computational efficiency. This paper presents Swin-Diff, a novel architecture integrating diffusion models with Transformer-based networks for robust defocus deblurring. Our approach employs a two-stage training strategy where a diffusion model generates prior information in a compact latent space, which is then hierarchically fused with intermediate features to guide the regression model. The architecture incorporates a dual-dimensional self-attention mechanism operating across channel and spatial domains, enhancing long-range modeling capabilities while maintaining linear computational complexity. Extensive experiments on three public datasets (DPDD, RealDOF, and RTF) demonstrate Swin-Diff’s superior performance, achieving average improvements of 1.37% in PSNR, 3.6% in SSIM, 2.3% in MAE, and 25.2% in LPIPS metrics compared to state-of-the-art methods. Our results validate the effectiveness of combining diffusion models with hierarchical attention mechanisms for high-quality defocus blur removal.
- Research Article
12
- 10.1109/tpami.2024.3457856
- Dec 1, 2024
- IEEE transactions on pattern analysis and machine intelligence
This paper proposes an end-to-end deep learning approach for removing defocus blur from a single defocused image. Defocus blur is a common issue in digital photography that poses a challenge due to its spatially-varying and large blurring effect. The proposed approach addresses this challenge by employing a pixel-wise Gaussian kernel mixture (GKM) model to accurately yet compactly parameterize spatially-varying defocus point spread functions (PSFs), which is motivated by the isotropy in defocus PSFs. We further propose a grouped GKM (GGKM) model that decouples the coefficients in GKM, so as to improve the modeling accuracy with an economic manner. Afterward, a deep neural network called GGKMNet is then developed by unrolling a fixed-point iteration process of GGKM-based image deblurring, which avoids the efficiency issues in existing unrolling DNNs. Using a lightweight scale-recurrent architecture with a coarse-to-fine estimation scheme to predict the coefficients in GGKM, the GGKMNet can efficiently recover an all-in-focus image from a defocused one. Such advantages are demonstrated with extensive experiments on five benchmark datasets, where the GGKMNet outperforms existing defocus deblurring methods in restoration quality, as well as showing advantages in terms of model complexity and computational efficiency.
- Research Article
5
- 10.1609/aaai.v38i6.28320
- Mar 24, 2024
- Proceedings of the AAAI Conference on Artificial Intelligence
Defocus blur, due to spatially-varying sizes and shapes, is hard to remove. Existing methods either are unable to effectively handle irregular defocus blur or fail to generalize well on other datasets. In this work, we propose a divide-and-conquer approach to tackling this issue, which gives rise to a novel end-to-end deep learning method, called prior-and-prediction inverse kernel transformer (P2IKT), for single image defocus deblurring. Since most defocus blur can be approximated as Gaussian blur or its variants, we construct an inverse Gaussian kernel module in our method to enhance its generalization ability. At the same time, an inverse kernel prediction module is introduced in order to flexibly address the irregular blur that cannot be approximated by Gaussian blur. We further design a scale recurrent transformer, which estimates mixing coefficients for adaptively combining the results from the two modules and runs the scale recurrent ``coarse-to-fine" procedure for progressive defocus deblurring. Extensive experimental results demonstrate that our P2IKT outperforms previous methods in terms of PSNR on multiple defocus deblurring datasets.
- Conference Article
155
- 10.1109/cvpr46437.2021.00207
- Jun 1, 2021
We propose a novel end-to-end learning-based approach for single image defocus deblurring. The proposed approach is equipped with a novel Iterative Filter Adaptive Network (IFAN) that is specifically designed to handle spatially-varying and large defocus blur. For adaptively handling spatially-varying blur, IFAN predicts pixel-wise deblurring filters, which are applied to defocused features of an input image to generate deblurred features. For effectively managing large blur, IFAN models deblurring filters as stacks of small-sized separable filters. Predicted separable deblurring filters are applied to defocused features using a novel Iterative Adaptive Convolution (IAC) layer. We also propose a training scheme based on defocus disparity estimation and reblurring, which significantly boosts the deblurring quality. We demonstrate that our method achieves state-of-the-art performance both quantitatively and qualitatively on real-world images.
- Conference Article
32
- 10.1109/cvpr52729.2023.00557
- Jun 1, 2023
Single image defocus deblurring (SIDD) refers to recovering an all-in-focus image from a defocused blurry one. It is a challenging recovery task due to the spatially-varying defocus blurring effects with significant size variation. Motivated by the strong correlation among defocus kernels of different sizes and the blob-type structure of defocus kernels, we propose a learnable recursive kernel representation (RKR) for defocus kernels that expresses a defocus kernel by a linear combination of recursive, separable and positive atom kernels, leading to a compact yet effective and physics-encoded parametrization of the spatially-varying defocus blurring process. Afterwards, a physics-driven and efficient deep model with a cross-scale fusion structure is presented for SIDD, with inspirations from the truncated Neumann series for approximating the matrix inversion of the RKR-based blurring operator. In addition, a reblurring loss is proposed to regularize the RKR learning. Extensive experiments show that, our proposed approach significantly outperforms existing ones, with a model size comparable to that of the top methods.
- Research Article
- 10.1007/s11263-025-02522-3
- Jul 5, 2025
- International Journal of Computer Vision
Reblurring-Guided Single Image Defocus Deblurring: A Learning Framework with Misaligned Training Pairs