Transformer-based methods effectively capture global dependencies in images, demonstrating outstanding performance in multiple visual tasks. However, existing Transformers cannot effectively denoise large noisy images captured under low-light conditions owing to (1) the global self-attention mechanism causing high computational complexity in the spatial dimension owing to a quadratic increase in computation with the number of tokens; (2) the channel-wise self-attention computation unable to optimise the spatial correlations in images. We propose a local–global interaction Transformer (LGIT) that employs an adaptive strategy to select relevant patches for global interaction, achieving low computational complexity in global self-attention computation. A top-N patch cross-attention model (TPCA) is designed based on superpixel segmentation guidance. TPCA selects top-N patches most similar to the target image patch and applies cross attention to aggregate information from them into the target patch, effectively enhancing the utilisation of the image’s nonlocal self-similarity. A mixed-scale dual-gated feedforward network (MDGFF) is introduced for the effective extraction of multiscale local correlations. TPCA and MDGFF were combined to construct a hierarchical encoder-decoder network, LGIT, to compute self-attention within and across patches at different scales. Extensive experiments using real-world image-denoising datasets demonstrated that LGIT outperformed state-of-the-art (SOTA) convolutional neural network (CNN) and Transformer-based methods in qualitative and quantitative results.