Thanks to the powerful modeling capabilities of multi-head self attention (MSA), transformers have shown significant performance gains in vision tasks. However, as transformers require heavy computation, more efficient designs are required. In this paper, we present an efficient transformer architecture named GoLDFormer for image restoration. GoLDFormer extends the capability of window-based self-attention through two core designs. First, We propose a globally-enhanced window-based transformer block (G-WTB), which applies transposed attention to a compressed window representation rather than the spatial features, thus establishing connections between all windows with less computational complexity. Second, since the interactions between image content and window attention weights can be interpreted as spatially varying convolution, we introduce an adaptive filter structure into transformer models and propose a deformable filtering block (DFB) to enable cross-window connections. By adjusting the shape of the generated filters in the DFB, we can balance the computational costs and the degree of adjacent window interaction. Extensive experiments on several image restoration tasks demonstrate that GoLDFormer achieves competitive results against recent methods but with optimal computational costs.
Read full abstract